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Systems cm silicon shew a continuous increase in complexity due to the ever 
increasing need for implementing new features and improvements of existing 
functions. This is enabled by the increasing density with which components can be 
integrated on an integrated circuit At the same time the clock speed at which circuits 
are operated tends to increase too. The higher clock speed in combination with the 
increased density of components has reduced the area which can operate 
synchronously within the same clock domain. This has created the need for a modular 
approach. According to such an approach the processing system comprises a plurality 
of relatively independent, complex modules. In conventional processing systems the 
systems modules usually communicate to each other via a bus. As the number of 
modules increases however, Ibis way of communication is no longer practical for the 
following reasons. On the one hand the large number of modules forms a too high bus 
load. On the other hand the bus fonns a communication bottleneck as it enables only 
one device to send data to the bus. A communication network forms an effective way 
to overcome these disadvantages. The communication network comprises a plurality 
of partly connected nodes. Messages from a module are redirected by the nodes to one 
or more other nodes. To that end the message comprises first information indicative 
for the location of the addressed module(s) within the network. The message may 
further include second information indicative for a particular location within the 
module, such as a memory, or a register address. The second information may invoke 
a particular response of die addressed module. 

It is an object of the invention to provide an integrated circuit and a method 
according to the introductory paragraph, which provides the modules therein a 
relatively simple way of issuing messages. 

In order to achieve said object the integrated circuit is characterized by the 
characterizing portion of claim 1. 

In the integrated circuit according to the invention modules can issue messages in a 
simple way, by using a single address. This makes it possible for a module to perform 
a write action to a particular memory address without being aware of the destination 
which comprises said address is stored 

In this way the network appears to the model issuing the message as a bus. This 
makes it relatively simple to incorporate already existing modules designed for a bus 
like architecture in an integrated circuit according to the invention. 

As such, processing systems are known, where a processor is coupled via a bus to 
various memories, which each are mapped onto a respective portion of the total 
address range. Byway of example a ROM and a RAM may be mapped to a first and a 
second address range respectively. When the processor performs a read instruction, 
the address in the instruction defines at the same time which memory is selected to 
read the data from. 
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la such known processing systems each of the various modules, such as memories are 
directly coupled to the bus. In the integrated circuit according to the invention, 
selecting one of the modules implies that the one or other memories are set in a state 
wherein they do not interfere with the bus traffic. Apart from the memory that is 
addressed no other module is required to perfoim an action (in fact, they don't have to 
and don't need to know that another module is active - i.e. they don't have to be 'set 
in a state'), or 2) that multiple concurrent and/or pipelined messages can be active 
simultaneously in the network as a whole. In an integrated circuit according to the 
invention however, information issued by the active module is transferred as a 
message via one or more nodes of the network, As a consequence it follows a 
different route through the network depending on the address. This route is scheduled 
by the network. 

Examples of the two pieces of information that are arranged as a single address are: 
Single logical memory space/map/range mapped to multiple distributed memories 
each with their own physical memory ranges. 

Virtual memory space mapped to a single logical memory space (distributed or not), 
Multiple memory spaces/maps/ranges mapped to multiple distributed memories. For 
2) and 3) two translations may take place (vm -> logical -> physical, and multiple -> 
single ->physioal). 

The integrated circuit of claim 3 and the method of claim 4 provide another way of 
improving data transfer in an integrated circuit comprising a plurality of modules 
connected by a network. 

Theoretically a transaction could comprise any number of outgoing and/or return 
messages. In practice however a transaction is made up of one or two outgoing 
messages (from the first to the second module), and zero, one, or two return messages 
(from the second to the first module). By managing the outgoing messages in a way 
different from the return messages the overall efficiency of the network and therewith 
the integrated circuit comprising the network is improved. This is further illustrated 
with the following embodiments. 

With reference to claim 5 it is remarked that GT connections can overbook resources 
in some cases. For example, when an ANIP opens a GT read connection, it must 
reserve slots for the read command messages, and for the read data messages. The 
ratio between the two can be very large (e.g„ 1:100), which leads either to large slot 
tables, or bandwidth being wasted for the read command messages. In order to 
prevent as much as possible that a reservation for guaranteed traffic would impede 
other transactions the bandwidth which can be reserved should be restricted. On the 
other hand the best effort traffic may use any resources which are currently available. 
As a consequence guaranteed traffic has bounded but on average higher latency than 
best-effort traffic which has no fixed upper bound, but is (or should be) faster on 
average. 

Based on this recognition it has been found that the overall quality of the network 
transport could be improved by exploiting BE packets for read command 
messages, and GT packets for read data messages. No guarantees can be offered in 
this case, but the overall throughput can be higher and more stable than in the case of 
using only BE packets. 
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With reference to claim 6 it is remarked that preferably Hie outgoing transactions are 
handled in a locally ordered and the return transactions in a globally ordered 
transaction mode. The one or more adressed modules process the transactions in the 
order they have been issued, and the return part of the transactions are all delivered to 
the first module in the order in which it initiated the transactions. Even if ordered 
channels are used, the responses from different addressed modules (e.g., in a narrow 
cast connection) must be sorted at the first module. This kind of ordering conforms 
with AMBA 

To implement global ordering, transactions that are delivered to different second 
modules (also referred to as slave) must be ordered exactly as they were sent by the 
first module (also referred to as master). This means that the network should either 
have a global time indicator, and use e.g. deadline-based scheduling in the network 
while in addition assumption on the consumption time of the second models must be 
available. An alternatively way to introduce global ordering is to introduce explicit 
dependencies between transactions. The latter can be done by using 
acknowledged/tagged transactions, where proof of delivery to the slave is sent bade to 
the mas ter using an acknowledgement message. This solution, however, introduces 
extra latency because transactions are sequentialised with a round-trip delay/latency 
per transaction, (send a message, wait for the acknowledgement, send next message, 
wait fbr next acknowledgement, etc.). By requiring only a local ordering for the 
delivery of the outgoing transactions, the slaves, provided that they are autonomous 
(which is usually the case) can execute messages independently. 

With reference to claim 7 it is remarked that in this way buffer space is used in 
an efficient way- A particular example is an embodiment wherein a large buffer 
space is reserved for the buffer of fhe network interface coupled to an active 
module, such as a module isuing a read command, and a small buffer space is 
reserved for the buffer of the network interface coupled to a passive module, e.g„ 
the one receiving the read message. 

Tn other situations (here may be different types of flow control (e.g- you never want 
to lose write commands, but don't mind losing read data). If a module can do both 
read and write commands, it may be important that write transactions always succeed 
(e.g.*when writing to an interrupt controller)* but that read transactions are not critical 
because they can be retried (so the GMD of the read transaction is dropped and the 
read never executed, or the RETDATA is dropped after the read has been executed. 
Another example is that if you know that writes always succeed if they are delivered, 
a flow-controlled connection is requested, Acknowledgements are not necessary in 
that case; Without flow control acknowledgements are compulsory, complicating the 
master and causing additional traffic. 

In the integrated circuit according to fhe invention the decision to drop messages or 
not is not decided per transaction but for the outgoing and return parts of connection 
as a whole. For example all outgoing messages having the format reads+address or 
writes+address+data) may be guaranteed lossless, while for all return messages 
(whether read data, write acknowledgements) packets may be dropped. 



A connection could be opened as follows: 
connid = open ( 
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nofc/fc, 

outgoing unordered/local/global, 
outgoing buffer size, 
return unordered/local/global, 
return buffer size); 

ie. all outgoing messages have certain properties, and all return messages have 
certain properties. 

With reference to claim 8 it is remarked that in a processing system with modules 
working asynchronously with respect to each other it is usual that a module receiving 
data issues an acknowledge signal to inform the issuing processor that it has received 
a message. In case that a message is multicast a plurality of said acknowledge signals 
is generated, which imposes a burden for the issuing processor. In the integrated 
circuit of the invention the first module receives only a single message, which reduces 
this burden. This measure is based on the insight that the network usually can 
relatively easily generate the single return message in response to the plurality of 
acknowledge messages of the second modules as a side effect of the functions already 
present in the netwoik for other purposes. 

With reference to claim 9: Depending on the situation the single return message can 
depend on the acknowled messages in various ways. The embodiment of claim 2 is 
favorable where the addressed second modules are memories, and the first module 
attempts to store data therein. In that case it is sufficient that only one copy of the data 
is really received and stored. 

With reference to claim 1 0: In other situations it is compulsory that each of the 

addressed second modules has received the data. In the embodiment of claim 10 the 

single return message is not generated until this is the case. 

Otherwise the returnn message could be combined as follows. 

If each of the write transaction has been successfully executed by all slaves, all will 

return RETSTAT^RETOK, which can be combined by 

the ANIP in a single messageto be delivered to the master. 

If the write transaction has been successfully executed only by some slaves, there 
will beamix of RETSTATs (RETOK and RETERROR). They can either be 
combined into 

(a) a single RETSTAT^RETERROR, to specify that an enor occured, or 

(b) a single RETSTAT, but a larger one, more descriptive, encoding 
where there have been errors. All RETSTATs can be bundled together 
in a single RETSTAT for the master, or <slave identifiers,error code> 
pairs can be bundled to form a single RETSTAT for the master. 

If the connection has no flow control, messages can be dropped 

at the PNIPs, resulting also in RETSTAT=RETLOST messages. Again, combinations 

as those above can be made. 
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With reference to claim 1 1 : In this way it is guaranteed that the first module always 
receives a response to a transaction, even if the connection has no flow control (Le. 
data may he dropped). This is done by only dropping data in the PNIP (the network 
interface coupled to the second, receiving module), and returning a FAIL/ERROR to 
the AND? (The netwotk interface coupled to the first module). This return status 
(RFTSTAT) message will never be dropped because the ANIP that initiated the 
transaction will reserve space for return messages of every transaction that it initiates. 
This combination of reserving space and generating an error message whenever a 
message is dropped is a way to introduce flow control. Preferably the RETSTAT 
message is generated by the interface of the receiving module, although alternatively 
it could be generated at the intermediary network nodes too. 
The method according to the invention g^rantA^ tnmqariirm completion. Le. it is 
always known whether an initiated transaction 

(a) was delivered and executed success&lly at the slave (RETSTAT=OK produced by 
the slave), or 

(b) was never delivered at the slave (RETSTAT=*RBQLOST produced by the PNDP), 
or 

(c) was delivered at the slave, but not successfully executed (RETSTAT=*ERROR 
produced by the slave), or 

(d) was delivered and executed successfully at the slave but the response message was 
dropped (RETSTAT>RETLOST produced by the ANIP). 

This is achieved by either M „ n _ A _. 
({) no t dropping messages (flow-controlled connection), m this case RETSTAl is 
either OK or ERROR, or 

(ii) by allowing messages to be dropped (on a connection without flow control), but 
generating a RESTAT (REQLOST or RETLOST) whenever the message is dropped, 
or a RETOK or RETERROR as usual when the message is not dropped 

It is essential however, never to drop RETSTATs, because this completes the 
transaction/This is realized in that a buffer for the RETSTAT is located at the master's 
ANIP. The latter reserves space for RBTSTATs when initiating transactions, and 
bounds the number of outstanding transactions (for finite sized RETSTAT buffers). 

The flow control cm the outgoing and return connections is in principle independent 
Thus, for outgoing flow control & return flow control, the RETSTAT message is 
according to a) or c) above 

In case of outgoing flow control & no return flew control, the RETSTAT message is 
a) or c) or d) above. 

In case of no outgoing flow control & return flow control, the RETSTAT message is 
a) orb) ore) above. 

Other embodiments are such an integrated circuit wherein the return message is a 
message indicating whether the second module has received a message from the first 
module. In this embodiment the return message can be very compact, e.g.. one or two 
bits to indicate one of the fbur options described above. 
Alternatively or in addition a retutn message comprises an identification of the 
message received by the second module. 

l? 26 ' 7 I suggest "efficiency" instead of '^erformsnce", because performance is just one of the 
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-~ peaonnance te.g,, D y aa<nng more connections for the same resources). 

Lae^ST^ Ho™, 
V Actaowledged write Mn&rtta write command + outgoing data use guaranteed throughput 

fr»de one m yourexa^le), and acknowledgment uses besteifort(rnofe two ia yZ eXe) 
X&reovex, except tune-related goarantees, there fa also a distinction on the buffering m pofcWarrf 

andactoowledgments. ConiequenOy, for a read transaction (your example) buffers for the ren3^T 

buffers for the outgoing part are larger, and those for acknowledgments are snille/^ 
Page: 8 

3 ; It is indeed possible to allocate different bandwidihs as you surest However the** ,Tc„ 
liniitations We oae a slot table, which cnntaim a 

raeryed allocating these slote to connections. Por example, if we use a table with 100 dota?afi» 
Wof 1^ each slot will be allocated for 1/100 from lus - 10ns. If the network provides IGb/s w 
to^ fte bandwidth per slot will be l/l 00 from 1 Gbs « 1 OMb/s, We canonly allocaSSb of 
lOMb/s for guaranteed througqrat traffic. * ot 

FOT »!l^ CWn T Xd f^^S ^fS bursts, allocating the intnimum bandwidth of lOMb/s would be 
probably to much, The bandwidth can indeed be used by best- 

effort traffic, however, not by other guaranteed throughput traffic. As a result, not all the traffic for 
which guarantees are needed may fit in the slot table. 

An alternative is to use more slots, but this increases the cost of the router. This is why. abest effort 
command may be a better solution. 
Page; 8 

4 ' f4f , This de^tito isjmod for outgoing messages, as there is one source (ANIP) and potentially 
multiple destinations (PNlPs). However, for return messages, we define global/local ordering as 
foUows. Global ordering means mat responses fromaUPNlPs/slaves (i.e. sources of messages in this 
case) com m me same order as the transactions have been miriated (ie., the gflm? order aa the 
commands have been issued by me master to the ANIP). Local ordering guarantees the order of 
response only if they come from the same slave/PNIP. 
5Page:$ 

Slave modules 

Page: 8 

6\ -We can only guarantee the order we offer transactions to the slave module, but the order of 
processing depends on the module implementation It can well decide to process transactions in a 
Afferent order (e.g., memory controller). For ordering we only require the responses are returned in 
the same order as the transactions were accepted 
Page: 8 

7: . This is only valid for global ordering. For local ordering (Le„ order preserved only per slave), 

n ordered transport channels are used^ no sorting is necessary. " 

Page,»g 

8- Global ordering of responses conforms with AMBA. Local ordering of responses does not 
Page: 8 * 

9. I think Keas meant write transactions may be critical and we don't want to loose them but 
read transactions can be lost, because they can be tried later. See example below in me text 
Page: 8 



10. The two cornmands (i.e., read and write) can indeed be sent from the same module. If we set 
up a connection with flow control for foe outgoing part both commands will be delivered. However if 
the return part has no flow control, the responses fox read commands may he lost. In such a case ihe 
read transactions will &u\ I mink Kees meant read transactions being lost, not read conunands being 



lost 
Page: 8 

11. fo = flow conirol, note = no flow control 
Page: 8 



12. Buffer is reserved only for a return status message, such as an acknowledgment, or an error 
message. Buffer can be, but is not necessarily reserved also for returned data. 
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page: 8 

13. Data can also be dropped at the ANIP RBTDATA) when no flow control is irnplexnented 
for the retuni part In such a case, a RETSTAl^RETLOST will replace the RBTSTAT=RETOK 
winch accompanied the dropped RBTDATA. 

Page: 8 

14. Has reserved 
Page: 8 

15. Yes, this is true. Between routers, there is always link level flow control and no data is never 
lost Data can be lost only In the network intcrfcees, if no end-to-end flow control (here referred 
simply as flow control) is implemented. Therefore, here, messages reach the PN3P even when no (end- 
to-end) How control is in^lernented. 



These and other aspects are described in more detail in the following three annexes 

1 . Communication Services for Networks on Chip, pages 1-25 by Andrei 
R3dulescu and Kees Goossens; 

Further background information useful for implementing the invention can be found 
at: 

2. Networks on Silicon: Blessing or Nightmare? pp 1 -5, by Paul Wielage and 
Kees Goossens (published), and 

3. Trade-Offs in the Design of a Router with Combined Guaranteed and Best- 
Effort Services for Networks on Chip, pp 1-6, by Edwin Rijpkeama, Kees Goossens, 
Andrei lUdulescu, Jef van Meerbergen, and Paul Wielage, submitted to and rejected 
byISSS2002. 
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Introduction 



Networks on chip (NoC) have received considerable attention recently 
as a solution to the interconnect problem in highly-complex chips [3-5, 7- 
9, 15, 19, 22]. The reason is twofold, First NoCs help resolve the eJectri- 
cal problems in new deep-subraicron technologies* as they structure and 
manage global wires [3-5,7, 8]. At the same time they ahara wires, lower- 
ing their number and increasing their utilization [7, 8]. NoCs can also be 
energy efficient and reliable [4], and are scalable compared to buses [9J. 
Second, NoCs also decouple computation from communication, which is 
essential in managing the design of billion-transistor chips [14, 22], NoCs 
achieve this decoupling because they are traditionally designed using pro- 
tocol stacks [21], which provide well-defined interfaces separating com- 
munication service usage from service implementation [5, 22], 

Using networks for on-chip communication when designing systems on 
chip (SoQ, however* raises a number of new issues that must be taken 
into account. This is because, in contrast to existing on-chip interconnects 
(&£., buses* switches, or point-to-point wires), where the communicating 
modules are directly connected* in a NoC die modules communicate re- 
motely via network nodes. As a result, interconnect arbitration changes 
from centralized to distributed, and issues like out-of order transactions, 
higher latencies, and end-to-end flow control must be handled either by 
the intellectual property block (if) or by the network itself. 

Most of these topics have been already the subject of research in the field 
of computer networks [24] and parallel machine interconnect networks [6]. 
However, on-chip networks have different properties (e.g., tighter link syn- 
chronization) and constraints (e,g., higher memory cost) leading to differ- 
ent design choices* which in the end affect the network services. 

In this paper* we compare NoCs and off-chip networks showing both 
their similarities and differences. We also explore the differences between 
NoCs and existing on-chip interconnects. We list new issues that must be 
resolved in system design due to the multi-hop nature ofNoCs* and present 
ah interface which takes these issues into consideration, Our interface are 
aimed at being similar to a split-transaction bus interface, such as VCI [25] 
Or OCP [17], to allow simple, low-cost wrappers to bus interfaces, and 
to allow backward compatibility with existing IPs. Our interface uses a 





018 08.10.2002 18:07:58 



PHNL021Q31EP? 



08.10.2002 



COMMUNICATION SERVICES FORPfOCS 



3 



request-response protocol that provides basic read and write operations. 
But our interface extends bus interfaces to folly exploit the power of our 
NoC [8, 19, 20], For example, it offers connection-based communication 
where end-to-end flow control and time-related guarantees (e,g.» bounded 
latency) can be requested. 

The paper i$ organized as follows. In the next two sections we compare 
NoCs properties with those of off-chip networks and buses, respectively. 
In Section IV, we present the services we offer in our network. Finally, we 
present our conclusions. 



. Networks have been the subject of research for decades, both in the 
context of local and wide area networks (computer networks) [24], and as 
an interconnect for parallel machines [6]. Bom are very much related to on- 
chip networks, and many of me results w those fields are also applicable 
on chip. However, NoCs premises are different from off-chip networks, 
and, therefore, most of the network design choices must be reevaluated 

NoCs differ from off-chip networks mainly in their constraints and syn- 
chronization. Typically, most on-chip resources have much tighter con- 
straints compared to off-chip. Storage (Le^ memory) and computation re- 
sources are relatively more expensive, whereas the number of point-to* 
point links is larger on chip than off chip [7]. 

Storage is expensive, because general-purpose on-chip memory, such as 
RAMs, occupy a large area. Having me memory distributed in the network 
components in relatively small sizes is even worse, as the of overhead area 
in the memory then becomes dominant 

Also computation for on-chip networks comes at a relatively high cost 
compared to off-chip networks. An off-chip network interface usually con- 
tains a dedicated processor to implement the protocol stack up to network 
layer or even higher; to off-load me host processor from the comrmmica-. 
tion processing. Including a dedicated processor in a network interface is 
not feasible on chip, as the size of the network interface will become com- 
parable to or larger than the IP to be connected to the network. Moreover, 
running the protocol stack on the IP itself may also be not feasible, be- 



ll. Networks Brought on Chip 
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cause often these IPs have one dedicated function only, and dc not have 
the capabilities Co run a network protocol stack, 

The number of wires and pins to connect network components is do 
order of magnitude larger on chip than of? chip [7], If they are not used 
massively for other purposes than NoC, they allow wide point-to-point in- 
terconnects (e.g., 300-bit links) [7, 15]. This is not possible off-chip, where 
links are relatively narrower 8-16 bits. 

On-chip wires are also relatively short, allowing a much tighter syn- 
chronization than off chip, This allows a reduction in the buffer space in 
the routers because the communication can be done at a smaller granu- 
larity. In the current semiconductor technologies, wires are also fast and 
reliable, which allows simpler link-layer protocols (e.g., no need for er- 
ror correction, or retransmission). This also compensates for the lack of 
memory and computational resources. 

In the rest of the section, we list five network issues that have a direct 
impact on the NoC cost reliable communication, deadlock; data ordering, 
network flow control and buffering strategy, and time-related guarantees. 
For each of them, we discuss the differences and similarities for on- and 
off-chip networks. 

Rdiahla communication. A consequence of the tight on-chip re- 
source constraints is that the network components (Le., routers and net- 
work interfaces) must be fairly simple to minimize computation and mem- 
ory requirements. Luckily, on-chip wires provide a reliable communication 
medium, which avoids the considerable overhead incurred by the off-chip 
networks for providing reliable communication. Data integrity can be pro- 
vided at low cost at the data link layer, However; data loss also depends 
on the network architecture, as hi most computer networks data is sim- 
ply dropped if congestion occurs in the network [6,24]. On-chip, dropping 
data may lead to a too costly implementation of reliable communication. 
We show below that a network where no data is dropped can lead to a much 
•lower-cost solution, at the peril of introducing the possibility of deadlock. 

Deadlock* Computer network topologies have generally an irregular 
(possibly dynamic) structure and bidirectional links, which can introduce 
buffer cycles, In such topologies, packet dropping at the network nodes 
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may be required to avoid deadlocks. 

Deadlock can also be avoided without dropping data, for example, by 
introducing constraints either in the topology or routing. Fat-nee topolo- 
gies have already been considered for NoCs, where deadlock is avoided by 
bouncing bock packets in the network in case of overflow [9]. Tile-based 
approaches to system design [7, 15,23] use mesh or torus network topolo- 
gies, where deadlock can be avoided using, tor example, a turn-model rout- 
ing algorithm [6], 

An alternative solution for deadlock in NoCs, which takes into consider- 
ation that modules connecting to the network are either masters (initiating 
requests and receiving responses), or slaves (receiving requests and send- 
ing back responses), is to mftintfttri separate virtual networks (with separate 
buffers) for requests and responses [6]. 

Data ordering. In a network, data sent ton a source to a destina- 
tion may arrive out of order due to reordering In network nodes, following 
different routes, or retransmission after dropping. For off-chip networks 
out-of-order data delivery is typical. However; for NoCs where no data is 
dropped, data can be forced to follow the same path between a source and 
a destination (detemnmstic routing) with no reordering. This in-order data 
transportation requires less buffer space, and reordering modules are no 
longer necessary. 

Network now control and buffering strategy* Network flow con- 
trol and buffering strategy have a direct impact on the memory utiliza- 
tion in the network. Wormhote routing requires only a flit bufTer in the 
router, whereas store-anoVforward and virtual-cut-through routing require 
at least the buffer space to accommodate a packet. Consequently, on chip, 
wormhole routing may be preferred over virtual-cut-through or store-and- 
forward routing. Similarly, input queuing may be a lower memory-cost al- 
ternative to virtual- output-queuing or output- queuing buffering strategies, 
oceans e it has fewer queues. Dedicated flfo memory structures at a low cost 
also enable on-chip usage of virtual-cut-through routing or virtual output 
opening for abetter performance [19). However, using virtual-cut-through 
routing and virtual output queuing at the same time is still too costly [19]. 




Figure L A network Figure 2* Abu? 

interconnect example interconnect example 



Time-related guarantees* Off-chip networks typically use packet 
switching and offer best-effort services. Contention can occur at each net- 
work node, making latency guarantees very hard to offer. Throughput guar- 
antees can still be offered using schemes such as rate-based switching [26] 
or deadline-based packet switching [18], but with high buffering costs. 

An alternative to provide such time-related guarantees is to use time- 
division multiple access (TDMA) circuits, where every circuit is dedicated 
to a network connection. Circuits provide guarantees at a relatively low 
memory and computation cost Network resource utilization is increased 
when the network architecture allows any left-over guaranteed bandwidth 
to be used by best-efibrt communication [1 0, 19, 20], 



HL From buses to NoCs 

Introducing networks (Figure I) as on-chip interconnects radically 
changes the communication when compared to direct interconnects, such 
as buses or switches (Figure 2). This is because of me multi-hop nature 
of a network, where communication modules are not directly connected, 
but separated by one or more network nodes. This is in contrast with the 
prevalent existing interconnects (i.e., buses) where modules are directly 
connected The implications of this change reside in the arbitration (which 
must change from centralized to distributed), and in the communication 
properties (e.g., ordering, or flow control). 
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In tins section, we list some gf these topics, and outline the differ- 
ences of NoCs and buses. We refer mainly to buses as direct intercon- 
nects, because currency they are the most used on-chip mtexcoanect Most 
of the bus characteristics also hold for other direct interconnects (e.g., 
switches [16]). Multilevel buses are a hybrid between buses and NoCs. 
Depending on the functionality of the bridges, for our purposes, multilevel 
buses either behave like simple buses [2] or like NoCs. 



Programming Model The p ro gr ammin g model of a bus typically 
consists of load and store operations which are implemented as a se- 
quence of primitive bus transactions. Bus interfaces typically have dedi- 
cated groups of wires for command, address, write dam, and read data [1, 
12,13,17,25]. 

A bus is a resource shared by multiple IPs. Therefore, before using it, 
IPs must go through an arbitration phase, where they request access to the 
bus, and block until the bus is granted to mem. 

A bus transaction involves a request and possibly a response. Modules 
issuing requests are called masters, and those serving requests are called 
slaves. If there is a single arbitration for a pair of request-response, the 
bus is called non-split, in mis case, the bus remains allocated to the master 
of the transaction until the response is delivered, even when this takes a 
long time. Alternatively, in a split bus, the bus is released after the request 
to allow transactions from different masters to be initiated. However, a 
new arbitration must be performed for the response such that the slave can 
access the bus [11]. 

For both split and non-split buses, bom communication parties have di- 
rect and immediate access to the status of me transaction. In contrast, net- 
work transactions are one-way transfers from an output buffer at the source 
to an input buffer at the destination that causes some action at the destina- 
tion, the occurrence of which is not visible at die source [61 The -effects of 
a network transaction are observable only through additional transactions. 
A request-response type of operation is still possible, but requires at least 
two distinct network transactions. Thus, a bus-like transaction in a NoC 
will essentially be a split transaction. 
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Trans action Ordering, traditionally, on a bus all transactions are 
ordered (cf. Peripheral VCI [25], AMBA [1], or CoreConnect PLB and 
OPB [12, 13]). This is possible at a low cost, because die mterramnect, 
being a direct link between the coixwrraicaong parties, does not reorder 
of data. However, on a split bus, a total ordering of transactions on a sin- 
gle master may still cause performance penalties, when slaves respond at 
different speeds. To solve this problem, recent extensions to bus protocols 
allow transactions to be performed on connections. Ordering of transact 
tions within a connection is still preserved, but between connections there 
are no ordering constraints (e,g, OCT [17J, or Basic VCI [25]). A few of 
the bus protocols allow out-of-order responses per connection in their ad- 
vanced modes (e,g. p Advanced VCI [25]), but both requests and responses 
arrive at the destination in the same order as they were sent 

In a NoC, ordering becomes weaken Global ordering can only be pro- 
vided at a very high cost due to the conflict between the distributed nature 
of the networks, and the requirement of a centralised aibifcatjon necessary 
for global ordering. 

Even local ordering, between a source-destination pair, may be costly. 
Data may arrive out of order if it is transported over multiple routes. In 
such cases, to still achieve an in-order delivery, data must be labeled with 
sequence numbers and reordered at the destination before being delivered. 

Atomic Chains of Transactions. An atomic chain of transactions is 
a sequence of transactions initiated by a single master that is executed on 
a single slave exclusively. That is, other masters are denied access to that 
slave, once the first transaction in the chain claimed it. This mechanism is 
widely used to irnplenients synchronization mechanisms between master 
modules (e.g., semaphores). 

On a bus, atomic operations can easily be implemented, as the central 
arbiter will either (a) lock the bus for exclusive use by the master request- 
ing the atomic chain, or (b) know not to grant access to a locked slave. 
In- the former case, the time resources are looked is shorter because once 
a master has been granted access to a bus, it can quickly perform all the 
transactions in the chain (no arbitration delay is required for the subsequent 
transactions in me chain). Consequently, the locked slave and the bus can 
be opened up again in a short time. This approach is used in AMBA, and 
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CoreConnect In the latter caso, the bus is not locked, and can still be used 
by other modules, however, at the price of a longer looking time of the 
slave. This approached is used in VCI and OCR 

In a NoC, where the arbitration is distributed, masters do not know that 
a slave is locked. Therefore, transactions to a locked slaved may still be 
initiated, even though the locked slave cannot accept them. Consequently, 
to prevent deadlock, these other transactions muse be either dropped, or 
stored such that transactions in the atomic chain can be filtered and snU be 
served. Moreover, the time a module is looked is much longer in case of 
NoCs, because of the higher latency per transaction. 

Deadlock. In buses, me deadlocks are not generally an issue. Dead- 
lock can still occur at the application level (e.g., an atomic chain of trans- 
actions mat locks the bus. which is never finished), but it is not caused by 
the interconnect Itself* 

In a network, deadlock becomes a more importam issue, and special 
care has to be taken in the network design to avoid deadlock. Deadlock is 
mainly caused by cycles in the buffers. To avoid deadlock, either network 
nodes must drop packets when their buffer are filled, or routing must be 
cycle-free. In a NoC, we believe the latter is preferable, because of its 
lower cost in achieving reliable communication (see Section II). 

A second cause of deadlock are atomic chains of transactions. The rea- 
son is thai while a module is locked, the queues storing transactions may 
gat filled with transactions outside the atomic transaction chain, blocking 
the access of the transaction in the chain to reach the locked module. If 
atomic transaction chafnp const be implemented (to be compatible with 
processors allowing this, such as MIPS), the network nodes should be able 
to filter the transactions in the atomic chain, or bo allowed to drop those 
blocking them. 

Media Arbitration. An important difference between buses and 
NoCs is in the media arbitration scheme. In a .bus, master modules re- - - 

quest access to the interconnect* and the arbiter grants the access for the 
whole interconnect at once. Arbitration is centralized as there is only one 
arbiter component, and global as all the requests as well as the state of the 
interconnect are visible to the arbiter. Moreover, when a grant is given, the 
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complete path from the source to the destination is exclusively reserved 

In a non-split bus, arbitration takes place once when a transaction is 
initiated As a result, the bus is granted for both request and response. In a 
split bus, requests and responses are arbitrated separately. 

In a NoC arbitration is also necessary, as it is a shared interconnect 
However, in contrast to buses, the arbitration is distributed* because it is 
performed in every router, and is based only on local information. Arbi- 
tration of the communication resources (links, buffers) is performed incre- 
mentally as the request or response advances [19]. 



Destination Name and Routing. For a bus, the command address, 
and data are broadcasted on the interconnect, They arrive at every destina- 
tion, of which one activates based on the broadcasted address, and executes 
the requested command This is possible because all modules are directly 
connected to the same bus. 

In a NoC, it is not feasible to broadcast infbnnatfon to all destinations, 
because it must be copied to all routers and network interfaces. This floods 
the network with data, The address is better decoded at the source to rind a 
route to the destination module. A transaction adcress will therefore have 
two parts: (a) a destination identifier, and (b) an internal address at the 



latency. Transaction latency is caused by two factors: (a) the access 
time to the bus, which is the time until the bus is granted, and (b) the 
latency introduced by the interconnect to transfer the data. 

For a bus, where the arbitration is centralized the access time is pro- 
portional to the number of masters cormecmd to the bus. The transfer la- 
tency itself typically is constant and relatively fast, because the modules 
are linked directly. However, the speed of transfer is limited by the bus 
speed which i? relatively slow for buses. 

In a NoC, arbitration is performed at each router for the Mowing link. 
— — The access time per router is small. Both end-to-end access time and trans- 
port time increase proportionally to the number of hops between master 
and slave. However, network links axe unidirectional and point to point, 
and hence can run at higher frequencies man buses, thus lowering die la- 
tency. 
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From a latency prospective, using a bus or a network is a trade off be- 
tween the number of modules connected to the interconnect (which affects 
access time), the speed of the interconnect, and the network diameter. 

Data format In most modem bus interfaces the data format is de- 
fined by separate wire groups for the transaction type, address, write data, 
read data, and return acknowledgments/errors (e.&, Vd, OCP, AMBA, or 
CoieComiect). This is used to pipeline transactions. For example, concur- 
rently with sending the address of a read transaction, the data of a previous 
write transaction can be sent, and the data ton an even earlier read trans- 
action can be received. Moreover, having dedicated wire groups simplifies 
the transaction decoding; mere is no need for a mechanism to select be- 
tween different kinds of data sent over a common set of wires, 

inside a network, there is typically no distraction between different 
kinds of data. Data is treated uniformly, and passed ten one router to 
another. This is done to minimise the control overhead and buffering in 
routers. If separate wires would be used for each of the above-mentioned 
groups, separate muting, scheduling, and queuing would be needed, in- 
creasing the cost of routers. 

In addition, in a network at each layer in the protocol stack, control in- 
formation must be supplied together with the data (e.g» command type, 
address, or data size). This control information is organized as an envelope 
around the data. That is, first a header is sent, followed by the actual data 
(payload), followed possibly by a trailer. Multiple such envelopes may be 
provided for the same data, each carrying the corresponding control infor- 
mation for each layer in the network protocol stack [6, 24]. 

Buffering and Slow Control. Buffering data of a master (output 
buffering) is used both for buses and NoCa to decouple computation from 
communication. However, for NoCs output bmYering is also needed to 
marshal dam, which consists of (a) (optionally) splitting the outgoing darn 
in smaller packets which are transported by the network, and- (b) adding 
control information for the network around the data (packet header). Ib 
avoid output buffer overflow the master must not initiate transactions that 
generate more data than the currently available space. 

Similarly to output buffering, input buffering is also used to decouple 
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compulation from commapicatioii. Ia a NoC, input buffering is also re- 
eled to irnmarshal data. 

In addition, flow control for input buffers differs for buses and NoCs. 
For buses, the source and destination are directly linked, and, destination 
can therefore signal directly to a source that it cannot accept data. This 
information can even be available to the arbiter, such that the bus is not 
granted to a transaction trying to write to a full bufrer. 

In a NoC, however, the destination of a transaction cannot signal di- 
rectly to a source that its input buffer is full. Consequently, transactions 
to a destination can be started, possibly from multiple sources, after the 
destination's Input buffer has tilled up. Two policies can be adopted when 
an input buffer is full. The first is not to accept additional meorning transi- 
tions, and to store them in the network. However, this approach can easily 
lead to network congestion, as the data could be eventually stored all the 
way xo the sources, blocking the auks in between. The second approach is 
to accept incoming transactions at a lull destination, and drop some data 
in the input buffer. Congestion is avoided but data is lost, which is unde- 
sirable. 

To avoid output buffer overflow connections can be used, together with 
end-to-end flow control. At connection set up between a master and one 
.or more slaves, bufrer space is allocated at the network interfaces of the 
slaves, and the network interface of the master is assigned credits reflecting 
the amount of buffer space at the slaves. The master can only send data 
when it has enough credits for the destination slave(s). The slaves grant 
credits to the master when they consume data. 



IV. The /Ethereal Approach 

As described in the previous two sections, NoCs have different prop- 
erties from both existing off-chip networks and existing on-chip inter- 

- connects, As a resulVexistmg protocols and service interfaces cannot be .. . 

adopted directly to NoCs, but must take the characteristics of NoCs into 
account For example, a protocol such as TCP/IP assumes the network is 
lossy, and includes significant complexity to provide reliable communica- 
tion. Therefore, it is not suitable in a NoC where we assume data transfer 
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reliability is already solved at a lower level On the other hand* existing 
on-chip protocols such as VCI, OCP, AMBA, or Core Connect are also not 
directly applicable. For example, they assume ordered transport of data: 
if two requests are initiated from the same ma ste r, they will arrive in the 
same order at the destination, This does not hold automatically for NoCs. 
Atomic chains of transactions and and-to-end flow control also need spe- 
cial attention in a NoC interface, 

Our objectives when defining our network services are the following. 
First, me services abstract from the network internals as much as possible. 
This is a key ingredient in tackling the ch a lleng e of decoupling the com- 
putation from communication [14,22], which allows IPs (the computation 
part), and the interconnect (the communication pare) to be designed inde- 
pendently from each other. As a consequence, our services are positioned 
at the transport layer in the ISO-OSI reference model [24], which is the 
first layer to be independent of the implementation of me network. 

Second, we aim at a NoC interface as close as possible to a bus inter- 
face. NoCs can then be introduced nonHferaptivery: with minor changes, 
existing IPs, methodologies and tools can continue to be used. As a conse- 
quence, we use a request-response interface, similar to interfaces for split 
buses [1,12, 13, 17,25]. 

Third, our interface extends traditional bus interfaces to fully exploit 
the power ofNoCa. For example, we offer connection-based communica- 
tion which docs not only relax ordering constraints (as for buses), but also 
enables new communication properties, such as end-to-end flow control 
based on credits, or guaranteed throughput [8, 19, 20]. All these properties 
can be set {hr each connection individually. 



A. The ethereal Connection and Transaction Model 

IPS interact with our network [8, 1 9, 20] at so-called network interfaces 
(Nl). Nlsi?rovide Nl ports (NIP) through which mecornrnunication services 
are accessed. As shown in Figure % a Nl can have several NiPSto which one 
or more IPS (computation elements or memories, but not intarconnection 
elements) can be connected. Similarly, an rp can be connected to more man 
one nis and nips. 
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Figure 3. Examples of links between nis and ips. 



Comrnunication between nips is performed on connections. Connec- 
tions are introduced to describe and identify communication with different 
properties, such as guaranteed throughput, bounded latency and jitter, or- 
dered delivery, or flow control. For example, to distinguish and indepen- 
dently guarantee communication of lMbs and 25Mb$, two connections 
can be used. Two nips can be connected by multiple connections, possi- 
bly with different properties. Connections as denned here are similar to the 
concept of threads and connections from OCP and VCI. Where in OCP and 
VCI connections are used only to relax transaction ordering, we generalize 
from only the ordering property to include configuration of buffering and 
flow control, guaranteed throughput, and bounded latency per connection. 

^Ethereal connections must be created with the desired properties before 
being used. This may result In resource reservations inside the network 
(e.g* butler space, or percentage of the link usage per rime unit). If the 
requested resources are not available, the network will refuse the request 
After usage, connections are closed, which leads to freeing the resources 
occupied by mat connection. 

To allow more flexibility in configuring connections, and, hence, better 
resource allocation per connection, the outgoing and return parts of con- 
nections are configured separately. For example, different buffer space can 
be allocated in the AMIP and PHIPs, respectively, or different hand widths 
can be reserved for requests and responses. 

Depending on the requested services, the time to handle a connec- 
tion (Le., creating, closing, modifying services) can be short (e.g., creat- 
ing/closing an unordered, lossy, best-effort connection) or significant (e.g., 
creating/closing a multicast guaranteed-throughput connection). Couse* 
- — — quentlyrcoanectlons are assumed to be created, closed, or modified infre- . 
queutl y, coinciding o,g„ with reconfiguration points, when the application 
requirements change. 

Communication takes place on connections using transaction, consist- 
ing of a request and a possibly response, The request encodes an operation 
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(e.g., read, write, flush, test and set, nop) and possibly carries outgoing 
data '(e.g„ for write commands). The response returns data as a result of a 
command (e.g., read) and/or an acknowledgment. 

Connections involves at least two HIPP* Transactions on a connection 
are always started at one and only one of the nips, called the connection's 
active nip (anip). All the other nips of the connection are called passive 
Nrftr(PNIP). 

There can be multiple transactions active on a connection at a time (as 
for split buses). That is, Transactions can be started at the AM IP of a connec- 
tion while responses for earlier transactions are pending. If a connection 
has multiple slaves, multiple transactions can be initiated towards different 
slaves. Transactions are also pipelined between a single pair of a master 
and a slave for both requests and responses. In principle, transactions can 
also be p ip&lined within a slave, if the slave allows this, 

A transaction is composed from me following messages (see Figure 4): 

• A command message (CMD) is sent by the anip, and describes the 
action to be executed at the slave connected to the pnip. Examples 
of commands are read, write, test and set, and flush. Commands 
are the only messages that are compulsory in a transaction. For 
NIPS that allow only a single command with no parameters (e.g., 
fixed-slae address-less write), we assume the command message 
still exists, even if it is implicit (Le^ not explicitly sent by the IP). 

• An out data message (outdata) is sent by the anip following a 
command mat requires data to be executed (e.B>, write, multicast, 
and test-and-set). 

• A return data message (retdata) is sent by a pnip as a cons©-' 
quence of a transaction execution that produces data (e,g., read, 
and test-and-set). 

• AtfoffljrfetfontfcJ^v^ 

message which is returned by PNIP when a command has been 
completed. It may signal either a successful completion or an er- 
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FiguteS. Transaction examples, 

ror. For transactions including both RETDATA and retstat the 
two messages can be combined in a single message for efficiency. 
However, conceptually, they exist both: retstat to si gnal the 
presence of data or on error, and retdata to carry the data. In 
bus-based interfaces Rstpata and retstat typically exist as two 
separate signals [1, 12,13, 17,25]. 

Messages composing a transaction are divided in outgoing messages, 
namely cmd and outdata, and response messages, namely rbtdata, 
RBTSTat. Within a transaction, CMD precedes all other messages, and 
RHTData precedes retstat if present These rules apply both between 
master and anip, and PNIP and slave. Examples of transactions are shown 
in Figure 5. 

We classify connections as follows (see Figure 6): 

• A simple connection is a connection between one anip and one 
PNIP, 

• A narrowcast connection is a connection between one A*flP and 
one or more fnips, in which the a^ip initiates transactions that 
are executed by exactly one pnip. An example of the narrow- 
cast connection is shown in Figure 7, where the anip performs 
transactions on an address space which is mapped on two mem- 

— ory modules, Depending on the transaction address, a transaction 
is executed on only one of these two memories, 

• A multicast connection is a connection between one anip and 
one or more pnips, in winch the sent messages are duplicated and 
each pnip receives a copy of those messages, In a multicast con- 
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Figure 6, Connection types, 

necrion no return messages are currently allowed, because of the 
large traffic they generate one response per destination). It 
could also increase the complexity in the an J? because individual 
responses from PNrPs must be merged into a single response for 
the ANiP. This requires buffer space and/or additional computa- 
tion for the merging itself. 



En this section we describe the properties that can be configured for 
a connection: guaranteed message integrity, guaranteed transaction com- 
pletion, various transaction orderings, guaranteed throughput, bounded la- 
tency and jitter, and connection flow control. 

Data Integrity. Data integrity means that the payload of the message 
is not changed (accidentally or not) during transport. We assume that data 
integrity is already solved at a lower layer in our network; namely at the 
1fr»k layer, because in current cm-chip technologies data can be transported 
uncoxzupted over links. Consequently, our network interface always guar- 
antees that messages are delivered uncornrptad at the destination. 

Transaction Completion. A transaction without a response is said to 
be complete when it has been executed by the slave. As there is no response 
message to the master, no guarantee regarding transaction completion can 



B. Connection Properties 
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Figures. Message ordering is observable at a, b, e, and d. 
be given. 

A transaction with a response is said to be complete when a rbtstat 
message is received from the ami* 1 . Hie transaction may either (a) be 
executed successfully, in which case a success rbtstat is returned, (b) 
fail in its execution at the slave, and then an execution error retstat is 
returned, or (c) fail because of buffer overflow in a connection with no Sow 
control, and men it reports an overflow error, 

In our network, routers do not drop data [20], therefore, messages are 
always guaranteed to be delivered at the NX, For connections with flow 
control, also Nls do not drop data. Thus, message delivery to the ips is 
guaranteed automatically in mis case. 

However, if there is no flow control, messages may be dropped at the 
network interlace in case of buffer overflow (see the paragraph on end-to- 
end flow control below). All of CMD, OUTData, and RETOata may be 
dropped at the Ni. To guarantee transaction completion, retstat is not 
allowed to be dropped. Consequently, in the a kips enough buffer space 
must be provided to accommodate RETSTAT messages for all outstand- 
ing transactions, This is enforced by bounding (fee number of outstanding 



Transaction Ordering. In this section, we describe the ordering re- 
quirements between different transactions within a single connection. Over 
"dilTerent connections no ordering of transactions is defined at the transport 
layer. 



J Wo assume that when dam is received as a reaponso (RETDaTa), a RETSTat (possibly 
impKdt) i& also received to validate the dam. 
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There several points in a connection where order of transactions can be 
observed (see Figure 8): (a) the order in which the master presents cmd 
messages to the anjp, (b) the order in which the CMDs are delivered to the 
slave by the pnip, (c) the order in which the slave presents the responses 
to the PNIP, and (d) the order the responses are delivered to the master by 
the anip. Note that not all of (b), (c), and (d) are always present More- 
over, there are no assumptions about the order in which the slaves execute 
transactions; we can only observe the order of the responses. We consider 
the order of the transaction execution to be a system decision, and not a 
part of the interconnect protocol. 

At bom anip and PNlPs, outgoing messages belonging to different 
transactions on the same connection are allowed to be interleaved For 
example, two write commands can be issued, and only afterwards their 
data follows. If the order of OUTDATA messages differs ftwn the order 
of CMD messages, transaction identifiers must be introduced to associate 
OUTDATAs with their corresponding CMD* 

Outgoing messages can be delivered by the pnips to the slaves (see 
Figure 8-b) as follows: 

» Unordered, which imposes no order on the delivery of the outgo- 
ing messages of different transactions at the pnips. 

• Ordered locally, where transactions must bo delivered to each 
pnip in the order they were sent, but no order is imposed across 
pnips, Locally-ordered delivery of me outgoing messages can be 
provided eimer by an ordered data transportation, or by reordering 
of outgoing messages at the pnip. 

• Ordered globally, where transactions must be delivered in me or- 
der they were sent, across all pnjps of me connection. Globally- 
ordered delivery of the outgoing part of transactions require a 
costly synchronization mechanism. 

Transaction response messages can be delivered by the slaves to the 
PNEP9 (seeHgnre*8-o) as follows: " 

♦ Ordered, when bjbTData and ret STAT messages are returned in 
me same order as the CMOS were delivered to the slave. 

* Unordered, otherwise. 
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vyhen responses are wardered, there has to be a mechanism to identity 
me transaction to which a response belongs. This is usually done using 
tags attached to messages for transaction identifications (similar to tecs in 
VQ). ^ 

Response messages can be delivered by the ANip to the master (see 
Figure 8-d) as follows: 

• Unordered, which imposes no csrder on the delivery of responses, 
Here, also, tags must be used to associate responses with iheir 
corresponding CMos. 

• Ordered locally, where retdata and RETSTat messages of trans- 
actions for a single slave are delivered m the order me original 
cmds were presented by the master to the anip. Note that there is 
no ordering imposed for transactions to different slaves within the 
same connection. 

• Globally ordered* where all responses in a connection are deliv- 
ered to the master m the same order as the original cmds. When 
transactions are pipelined on a connection, then globally-ordered 
delivery of responses requires reordering at the ANIP. 

AH 3 x 2 x 3 = 18 combinations between the above orderings are pos- 
sible. Out of these, we define and offer me following two. An unordered 
connection is a connection in which no ordering is assumed in any part 
of the transactions. As a result, the responses must be fagged to be able 
identity to which transaction they belong. Implementing unordered con- 
nections has low cost, however, they may be harder to use, and introduce 
the overhead of tagging. 

An ordered connection is denned as a connection with local ordering 
for the outgoing messages from pmips to slaves (Figure 8-b), ordered re- 
sponses at the pnips (Figure 8-c), and global ordering &r responses at the 
ANIP (Figure 8-d). We choose local ordering for the outgoing part because 
wdering has a to o high cost, and has few us es. The orderingof 
responses isselected to allow a simple programming model with no tag- • 
ging. Global ordering at the ANIP is possible at a moderate cost, because 
all the ordering js done locally in the anip. 

A user can emulate connections with global ordering at the PNIP9 using 
non-pipelined acknowledged transactions. 
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Connection latency, throughput, and jitter. In our network, 
throughput can to reserved for connections in a time- division multiple ac- 
cess (TDMA) fashion, whore bandwidth is split in fixed-size slots on a 
fixed time frame. Bandwidth, as well as bounds on latency and jitter can 
bo guaranteed when slots are reserved. They are all defined in multiples of 
the slots. 

Guaranteed^throughjnrt: connections can overbook resources in some 
cases. For example, when an aNjp opens a gnara^teeo>throu^hput read 
connection, it must reserve slots far the read command messages, and for 
the read data messages. The ratio between the two can be very large (o.g„ 
1 ; 100), which leads either to a large number of slots, or bandwidth being 
wasted for the read command messages. 

Ib solve this problem, we allow the request and response parts of a 
connection be configured independently for all of throughput, latency and 
jitter, Consequently, the request pan of a connection can be best effort, 
While the response can have guaranteed throughput (or vice versa). For 
the example mentioned above, we can use best effort read messages, and 
gnaranteedrthroughput read-data messages. No global connection guaran- 
tees can be offered in this case, but the overall throughput can be higher 
and more stable man in the case of using only best-effort traffic. 

Connection flow control. As mentioned earlier, our network guaran- 
tees that messages are delivered to the nl Messages sent from one of the 
nips are not immediately visible at the other W, because of the multi-hop 
nature of networks. Consequently, handshakes over a network would allow 
only a single message be transmitted at a time. This limits the throughput 
on a connection and adds latency to transactions. Tb solve this problem, 
and achieve a better network utilization, the messages must be pipelined. 
In this case, if the data is not consumed at the pnip at the same rate it 
arrives, either flow control must be introduced to slow down the producer, 
or data may be lost because of limited buffer space at the consumer Nl. 

We introduce end-to-end flow control at the level of connections, which 
requires buffer space to bo associated with connections. End-to-end flow 
control ensures that messages are sent over the network only when there is 
enough space in the nip's destination buffer to accommodate thenx 

End-to-end flow is optional (i.e., to be requested when the connections 
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opened) and can be configured independently for the outgoing and re- 
turn paths, When no flow control is provided, messages aid dropped when 
buffers overflow. Multiple policies of dropping messages are possible, as 
in off-chip networks. Possible scenarios include: (s) the oldest message is 
dropped (milk policy), or (b) the newest message is dropped (wine pol- 
icy)[24]. 

We opt for a credit-based flow control. Credits are associated with the 
empty buffer space at the receiver m. The sender's credit is lowered as data 
is sent, When data is delivered at the receiver nip, credits are granted to 
the sender. If the sender's credit is not sufficient to send some data, the HI 
at the sender stalls me sending. 



i 

C. Use Cases i 

To illustrate the need for differentiated services on connections, we 
show in this section some examples of traffic. We describe the properties 
they would use over an /Ethereal connection to meet their traffic require- 
ments. 

Video processing streams typically require a lossless, in-order video 
stream with guaranteed throughput, but possibly allow corrupted samples, 
An Ethereal connection for such a stream would require the necessary 
throughput, ordered transactions, and flow control. If the video stream is 
produced by the master, only write transactions are necessary. In such a 
case, with a flow-controlled connection there is no need to also require 
transaction completion, because messages are never dropped, and the write 
command and its data are always delivered at the destination. Data in- 
tegrity is always provided by our network, even though it may be not nec- 
essary in this case. 

Another example is that of cache updates which require uncorrupted, 
lossless, low-latency data transfer, but ordering and guaranteed through- 
put are less important In such a case, a connection would! not require any - - 

time related guarantees, because a low latency, even if preferable, is not 
critical Low latency can be obtained even with a best effort connection. 
The connection would also require flow control and guaranteed transac- 
tion completion to ensure loss-less transactions. However, no ordering is 
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necessary, because this is not important for cache updates, and allowing 
out of order transaction can reduce the response time. 



V, Conclusions 

In this paper, we compare networks on chip (NoC) to off-chip networks 
(eg., computer networks) and existing on-chip interconnects (e.g,, bosses). 
We show that NoCs have many similarities with off-chip networks. How- 
ever, mey also differ, especially m their resource constraints. For example 
On a chip, memory and computation resources are more expensive, while 
there are more wires. This makes NoC architecture? different from off-chip 
networks, and requires rethinking of network services. 

We also compare NoCs to existing on-chip interconnects, such as buses 
and switches. By directly connecting IP blocks, existing on-chip intercon- 
nects can offer tight coupling between masters and slaves, and global ar- 
bitration. In NoCs, masters and slaves are completely decoupled, and the 
arbitration is distributed over the network nodes. This make it harder to 
provide guarantees, such as bandwidth lower bounds, and transaction or- 

We define a set of NoC services that abstract from the network details. 
Using these services in the tp design decouples computation and com muni s 
cation. We use a request-response transaction model to be close to existing 
on-chip interconnect protocols. This eases the migration of current IPs to 
NoCs. lb tolly utilize the NoC capabilities, such as high bandwidth and 
transaction concurrency, our services provide connection-oriented com- 
numicatioa Connections can be configured independently with different 
properties. These properties include transaction completion, various trans- 
action ordering, bandwidth lower bounds, latency and jitter upper bounds, 
and flow control. * . 

Our services are a prerequisite for service-based system design which 
makes applications independent of NoC implementations, makes de- 
signs more robust, and enables architecture-independent quahty-of-servtce 
Strategies. 
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Abstract 

Continuing VLSI technology scaling raises several deep 
submicran 0SM) problems like relatively slow intercon- 
nect, power dissipation and distribution, and signal in- 
tegrity. Those problems are encountered particularly on 
Ions Wires fir global interconnect At clock frequencies h> 
crease, scaled vires become relatively slower, and on-chip 
communication will be ike limiting performance fixator of 
future chips. We explain why efficiently sharing of the wires 
for longdistance commun ication is the solution to this prob- 
lem. We introduce networks on silicon (M>$), that route 
packets over shared (semD-globol wires. NoS performance 
is expected to be high, but comes at a cost Balancing the 
performance and cast of a NoS is a major challenge, and 
we believe busses still have a role play. 



1 Technology trend 

VLSI technology scaling has long followed Moore's law. 
f{o fundamental barriers have been identified chat invalidate 
this law fox at leasi another decade [12]. Moore's law pre- 
dicts that chips in 2010 will count over 4 billion transis- 
tors, operating in the multi-GHz range. This abundance of 
aansis(ors will make very complex systems on silicon (SoS) 



However, challenges at all abstraction levels of design 
will nave to be addressed before such SoSs will become a 
reafiiy. Tna three most imponant deep submicron (DSM) 
challenges, related to all abstraction levels, are: substantial 
wire delay, controlling power delivery and dissipation, and 
assuring signal integrity. 

Until recently, on-chip wiring was cheap. Consequently 
architectural models have been employed that relied on low- 
latency communication to globally share expensive compu- 
tational resources. Global wire delay stays at best constant 
under technology scaling and hence these wires become ef- 
fectively slower compared to a ante delay. For example, 
for 130 nm technology the reachable distance of a repeated 
global signal in a clock cycle is no more than the length of a 




Figure 1. The number of 50k blocks for future 
process technologies. 



chip 14]. For 50 nm technology, crossing a chip with highly 
optimized interconnect takes between six and ten clock- 
cycles, clearly invalidating me low-latency assumption of 
today. Hence we must move to system-level architectures 
that scale with technology. 

A feasible template for a future-proof architecture is con- 
structed from processing nodes max do not grow in com- 
plexity with technology. Instead, as technology scales, the 
number of these processing nodes on the chip grows. An 
on-chip communication network then combines these nodes 
into a SoS 

Various publications show that the spanning wires in 
bJoclcs of 501c gates scale with technology £4, 13J. This 
means that the aforementioned DSM issues can be handled 
by CAP tools, assuming their evolutionary improvement. 
Figure 1 shows the exponentially increasing amount of such 
50k blocks fox a large die in subsequent technologies; is 
35 nm this number is approximately ton thousand (adapted 
from [13J and [4]). It remains to find a communication ar- 
chitecture that allows a SoS composed of these blocks co- 
operate efficiently 

2 Networks on silicon are inevitable 

Given the growing demand for and impact of intercon- 
nect on system cost and performance. It is wonhwmlo to op- 
timize the utilization of wires. Ad-hoc global wiring struo 
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tares often lead to a huge number of wilts with an aver- 
age usage as tow as 10% in time [2], To control cost in 
this scenario, the wire packing density must be very high, 
which is not beneficial for the power and delay character! s- 
tics, Efficient mechanisms for sharing (scmi>global wires 
most solve this CQSi-pcrfwmanee dilemma. 

In deep snbmloron technologies, (senu>global wires 
need special attention for power, sign&l-msegrily, and per- 
formance reasons. In the discussion below we show how 
special circuit techniques can handle these issues. Such 
techniques only work, however, when embedded in ded- 
icated cenrraunicatitin IP, which provides a more abstract 
interface. 

Power is an issue for global interconnect because it costs 
more energy lo send a bit of information over longer the 
wires, lb reduce ihe communication delay, the energy con- 
sumption increases due to bigger drivers. Employing low- 
swing signaling for the global wires saves up to a factor four 
in power for these wires [15]. Implementing law-swing sig- 
naling requires special circuit techniques. 

Signal integrity is hampered increasingly by growing ca- 
pacitive and inductive coupling between wires. Capacitive 
noise coupling is the result of the large aspect ratio of wires 
in DSM technologies. Inductive noise coupling becomes 
more of a problem due to the decreasing transition times. JJS. 
drop 1 in the supply distribution increasingly co ntribut es to 
the noiae. The most effective way to make a connection ro- 
bust against noise is application of differential signaling £7], 
Differential fiifl* flB "g improves both the generation of and 
sensitivity to noise. 

The signal propagation delay of an uninterrupted wire 
grows qaadratfcfllly with its length; hence from a ceftfrm 
length Onwards it is advantageous to partition the wire ra 
segments with repeaters in between. The repeater Insertion, 
technique improves bandwidth and latency but at the cost of 
higher power consumption. Wire delay can be reduced by 
fat wims with a lower resistance per unit length at the cost 
of lower wire density. Such wires behave like lossy trans- 
mission lines and require drivers with a resistance marched 
to the transmissjou line. 

As a result, we believe that all inter-block communica- 
tion will be implemented by hard-macro transmitters and 
receivers, employing low-swing differential signaling, with 
well-controlled interconnect instead of ad-hoc drivers han- 
dled by standard place-and-route tools. In this way, commu- 
nication links can be realised with predictable performance 
and DSM robustness. 

Currently, the prevalent on-chip interconnects are 
busses [1]. In a bus architecture, devices share a single 
transmission medium to communicate. At a given time* 




Figure 2, Structural vlaw of a network on sil- 
icon consisting of processing nodes (P) and 
nodes supporting communication (R, &). 



only one device bus access to the shared medium. An ar- 
bitration mechanism is required to order simultaneous ac- 
cesses. Such functionality i$ typically performed by a cen- 
tralized bus arbiter. The performance of a shared-medium 
bus scales badly. For an increasing number of bus clients 
ft) mdrvidual clients get less bandwidth on average, and (h*) 
increased capacitive loads and wiro length, decrease the total 
bandwidth. 

A solution that pairs scalable communication perfor- 
mance and minimal interconnect cost is expected from ne> 
-works on silicon (KoS) where the So3 is considered as a 
network of components [2, 3, 1]. Figure % iltusttatcs the 
hardware architecture of this concept The outer compo- 
nents (marked P) exclusively perform processing and stor- 
age functions, whereas the inner components (marked B and 
R) form the NoS and cater to communication needs of too 
outer components. The basic building blocks of a NoS are 
routers (R). 

A router forwards data from its input ports to its out- 
put ports in a concurrent fashion. Tb that end, a router of 
ariry N contains a JV x JV switch matrix. Data packets 
make their way through the network based on the routing 
information in their headers. A link between two routers is 
implemented by a point-to-point connection. The links typ- 
ically span medium to long distances ranging from several 
to over more than twenty millimeters. TJie actual length de- 
pends on the chosen topology of the network. For a mesh 
topology the links are relatively short, for a torus which is 
a mesh with wrap-around connections, some links have a 
length of half the edge of the chip. Links can be optimized 
for bandwidth, latency, power, or a combination of these, 
I on performance requirements. 



1 Supply voltage dxopa arc caused by high cmzcnfB (T) Spying through 
die vs^mot (R) ftf fte supply jrcwttk. Since the supply votegs reduces 
under scaling 1R drop worsens. 



3 NoS requirements 

An important characteristic of a future system-level ar- 
chitecture is the separation between computation and com- 
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inunication. A NoS allows the computational blacks to 
communicate with one other via a imrfoim Interface. A 
uniform interface is advantageous because (i) It frees the 
core developer from having to males assumptions about the 
system in which toe com will be used, and (ij) docs not 
constrain the development of newer communication archi- 
tectures by detailed interfacing requirements Qf particular 
legacy SoC components [61. Several on-chip bus standards 
are evolving to realize this goal, most notably VCJ, put for- 
ward by VSIA [14], and more recently, the Open Core Pro* 
TOCOl [10], 

The fundamental aim of a NoS is to proviae flexible and 
efficient communication between the thousands of IP blocks 
in a system, with performance guarantees, m a typical SoS, 
tho communication demands of different IP blocks snow 
large variations. For example, data rates may be constant 
(e.g. digital video) or variable (ag. compressed video). Hie 
importance of latency and Jitter also varies greatly. Finally, 
the data granularity may range from single words to large 
blocks. A NoS should be able to offer different services to 
different clients. Bach service class must be implemented 
efficiently, using a shared uniform infrastructure. 

A high utilization of me network comes at a price. When 
the network starts to saturate, throughput and latency will 
show huge variations, which is uot acceptable in real-time 
applications. Hence, the network should also provide guar- 
antees, IjJfe loss-loss dam transport, minimal bandwidth, 
and bounded latency, The way packets am buffered and 
scheduled in routers, and the effects on performance guar* 
antees has been the subject of intense research. Funda- 
mentally, sharing and guarantees are conflicting, and effi- 
ciently combining guaranteed traffic with best-effort traffic 
is hard [11]. Although best-effort services are cheaper than 
guaranteed services we believe that the latter are essential 
because they enable compositional and scalable integration 
of the IP blocks £5]. It Is up to the IP integrator at design 
time, and up to the application at run time, to make a trade 
off. 



4 Performance and cost analysis of NoSs 



The vision of previous sections is that me design of fu- 
ture SoSs will allow IP blocks to be plugged in at will to 
minimize communication costs, but without today's prob- 
lems like timing closure. In this section we investigate the 
cost implications of system design based on a NoS. We hope 
the vision comes at acceptable cost. We hope thai the over- 
all cost of a NoS, including the full protocol stack to use it, 
tarn out to be acceptable such mar the integration blessings 
of NoSs do not change into a cost nightmare. 
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4.1 Performance 

The- aggregate bandwidth ofa router is the product of tho 
bandwidth per port, Bty portt tjie arity of the router (number 
of ports), N, and a utilization factor, a £ I corresponding 
to the router arbitration scheme. 

BWjvvtcr^CtN BY?*** (1) 

We discuss each in ouu The bandwidth per port is deter- 
mined by the bandwidth of the link and the router data path. 
Inshore 

SWpcrt = B wfa(BWvir^ BW^ts^-pan) (2) 

where B is the width of the dam path. The combined band- 
width of the B wires of a link is a function of tho layout 
characteristics (eg. total length)* chosen signaling tech- 
nique, and the budgets for power, delay, and area. A first- 
order expression for me bandwidth of a repeated global wire 
optimized for power-delay is 

where F04 is the delay of an inverter driving four equally 
sized inverters [4]. In a 100 cm technology, mis yields 5 
Gb/s per wire Under worn-case environmental conditions. 
Nonce that the bandwidth of repeated global wires scales 
with technology because such wires allow (wave) pipelining 
at the segments. 

Running the router dam path at 5 GHz is not feasible. An 
aggressive but realistic frequency is 1.25 GHz correspond- 
ing the clock frequency of 50k gates blocks [4J. Hie cH^ I 
function in the data path is the N x N switch, For N up 
to 20 it meets the 1.25 GHz data rate, using N 1-out-of-JV 
multiplexors. The relaxed demand on the wires of the link 
can be used to reduce power dissipation and area. 

The utilisation factor, a, reflects the effectiveness of 
the router to resolve contention on tho links. The queu- 
ing strategy, the queue sizes, and the schedule algorithm all 
strongly mflrifinng a. Accordingly, many queuing policies 
and scheduling algorithms have been presented in the liter- 
ature. For example, a = 0.59 for infinite fifo input queues 
with uniform and independent traffic (Virtual) output queu- 
ing gives a = 1 under the same conditions, but at the 
cost of larger queues and a more complex scheduling algo- 
rithm [8]. Static scheduling techniques like (time-division- 
multiplexed) circuit switching can also improve the utihza- 

Hence, in 100 nm technology, the bandwidth of a 32 bit 
router port is approximately 5 GByrc/sec, 

4*2 Cost 

Three main components contribute to the area cost of a 
router: the switch, the control logic, and the packet queues. 
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The switch allow? N sirnnlianeons connections from ihfl 
JV inputs to ibo JV outputs which results in 3 arrays of JV x 
JV wires, giving rise to an 0(N*) area cost 

The control logic of a rooter is made up of toe switch- 
matrix schedule unit and other configuration logic Toe 
delay of a schedule cycle varies greatly per algorithm 
(for ojcample, for virtual colour queuing from 0(1) to 
0(N*/*) it 3s important for two reasons, First, it de- 
termines the lower bound for latency that a flit 2 incurs to 
traverse the router Second, it affects the size of the queues. 
The longer a schedule cycle, the more data arrive, given a 
fixed bandwidth of a port BW VOT i- This loads to deeper 
queues, and higher area cost 

The three aforementioned queuing strategies require 
queues of sfee 0{N) toOfiV 2 ) flits. S che d uli ng algorithms 
perform hener with deeper queues, with a decreasing remm. 

Besides routes, a significant amount of area is consumed 
by so-called network interfaces (ND modules. These mod- 
ules translate the IP transactions for a given connection to 
packets that are sent over the network, and vice versa. Pack- 
ets can be sent once the payload has been completely ac- 
cepted by the NL Hence, die buffers must be dimensioned 
such that, at least a complete packet for every simultane- 
ously active connection can be stored. 

Hie trade off between utilisation a and the cost is a com- 
plex one, but of importance to the viability of NoSs. 

5 The future rote of bosses 

Id sections 1 and 2 we have argued that NoSs are essen- 
tial to solve SoS integration in a scalable fashion. While 
Section 4.2 raised some general cost issues, we will now 
more concretely consider the trade off between busses 
andNoSs. WUl pacfet-nyitahed NoS? completely replace 
current busses in future SoSs, or will a hybrid approach 
emerge? We believe that shared busses may have a role 
to play in first-level ccmnranication (B in Figure 2) for the 
following reasons. 

First, typical IP blocks underutili^e the bandwidth ca- 
pacity of an individual router port AH router ports offer the 
same bandwidth that is inherent to the architecture, whereas 
the bandwidth requirements of IP blocks varies greatly. A 
shared memory module needs typically much higher (peak) 
bandwidth than a streaming peripheral device. Single word 
transfers, variable bit rates, bursty 10, and much lower clock 
rates far IP blocks man for the NoS further waste band- 
width. Tbis means mat the communication needs of a num- 
ber of IP blocks can be aggregated using a bus before the 
capacity of a network link is reached. 

Second, network interfaces am more expensive (in terms 
of area) than a bus adaptor. Using a bus as a first-level traf- 

spmfc for PcW CffUitfl tfBfc to made pomoa of data handled 
per uftcdufe cytfc. A pacta Is decomposed in flits. 
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Figure 3. A shared- medium pu3 seems a cost- 
effective way to connect the IP to the packev 
switched network. 



fie concentrator, trading bus adaptors for network interfaces 
thus reduces the overall cost of IP-NoS mtcrfaciofi. Wo ex* 
pect that the overhead of a baa audits wtwoi&mnsrwce are 
Outweighed. 

Finally, the number of routers is reduced significantly 
when busses are used as the first-level interconnect. Routers 
are larger man busses duo to their packet queues and more 
complex scheduling. We gfvo an example below. 

An example of the heterogeneous communication archi- 
tecture is depicted in Figure 3, A router of arity three sur- 
mundedby twelve JCP bocks is shown. Two shared-medium 
busses, eachccmiectedto six 50k gates IP blocks, commu- 
nicate with the router via two network interfaces. These 
have two functions: first they sohednla the transactions on 
the bos, and second they given the bus clients access to the 
packet-switched network. The third port of the router pro- 
vides communication to the remainder of the network. Fig- 
ure 4 shows an architecture using only routers. Now three 
routers of arity five and one of arity four are needed. 

The suggested shared-medium bus has a length of 35k\ 
where A is half of the length of a nunrmal transistor: Global 
wires of this length will not be the botfle-neck of bus per- 
formance. 3 

The feasibility of hybrid NoSs hinges on the right imple- 
mentation of the busses. First, they muse be shared wires, 
as opposed to switches. Second, their arbitration must be 
comhin ed, or at least compatible with, the scheduling raking 
place in the network interfaces, to offer uniform end-to-end 
network services. 

We see a future for hybrid NoSs, with first-level commu- 
nication overa shared-medium bus, and the higher levels us- 
ing a paefce^switched network. Perhaps a packet-switched 
network can be seen as a distributed and scalable implemen- 
tation of a logical bridge that connects all the local busses of 
the SoS. Deciding how many IP blocks can use a local bus 

Optimized to powtr-dcJoy prodya ht*t a icn^dl of 48kX Tliese knpfcs 
fl»!fi >viih wdujojogy lUoMbocdgc of 50Jc Mode* [4]. 
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before connecting to the rower network is a question that 
must be answered foremost 

6 Conclusion 

We have argued in Section 1 that fbotta systems on sili- 
con (SoS) will be composed of large numbers of process- 
ing nodes (of IP blocks). Bach processing node is rela- 
tively small (50k gates) to scale with technology, and can 
be handled by CAD tools, assuming their evolutionary im- 
provement. The interconnect and comrrninication between 
these blocks then becomes an essential function in itself 
(Section 2X leading to networks on silicon (NoS). A NoS 
is based on packet switching 10 flexibly share Enk capacity 
between the network clients, and to pro viae pluriform com- 
rnuxiicatian services over a uniform infrastructure, Boih ef- 
ficiency, provided by best-effort traffic, and predictable per- 
formance, such as guaranteed throughput and latency, arc 
important (Section 3), Efficiently combining them is a chal- 
lenge. Section 4 shoyred that (he performance of a NoS de- 
pends on many factors, but is expected to be high. The cost 
of a NbS can be siatc4 in terms of area (routers, network in* 
terraces), utilization of wires, and speed (latency). Thoycan 
bo traded off against one another, hut also, perhaps mom in- 
cercstrngly; against (ho cost of busses, A hybrid NoS using 
shared-wire busses to communicate locally, and accumulat- 
ing traffic for a core router network is a promising architec- 
ture that deserves to be investigated. 
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ABSTRACT 

Managing the complexity of designing chips containing trillions of 
transistors requites decoupling computation from communication. 
For the communication, scalable and compositional interconnects 
(such aa networks on chip (NoC)) nnist t>fi Tiscd In this paper we 
show tout guaranteed services am essential in achieving this de- 
coupling. Guarantees typically come at the cost of inefficient re- 
source utilisation. To achieve efficiency, they must be used in com- 
bination with bcamffoit services. We describe a NoC architecture 
(bat efficiently combine? guaranteed and besfcoftojt services. The 
key element of our NoC is a router consisting conceptually of two 
parts: the so-called guaranteed throughput (Gt) and best-effort (be) 
ranters. Both offer dam integrity, lossless and in-order data deliv- 
ery. Additionally; the CT renter offers guaranteed throughput and 
latency services. We combine the GT and BE router architectures cf- 
nciently by sharing router resources, enabling high link utilization. 
The guarantees are never affected by the be traffic, and Hnks axe 
efficiently utilized because BE traffic uses all bandwidth left over 
from OT traffic* Connections are programmed using BE packets. 
Tbe programming model is robust, concurrent, and distributed. It 
enables run-time and coropfle-time, deterministic and adaptive con- 
nection management For all our architectural choices, we show the 
trade offis between hardware complexity and efficiency, and moti- 
vate our choices. 

1. INTRODUCTION 

Recent advances in technology raise the challenge of managing 
the complexity of designing chips containing billions of transistors. 
A key ingredient in tackling this challenge is decoupling the com- 
putation from commutation [9, IS]- This decoupling allows IPs 
(the computation pari), and the interconnect (the communication 
part) to be designed independently from each other: 

Jn this paper, we focus on the communication part. Existing in- 
terconnects (e*g., busses) may no longer be feasible for chips with 
many IPs, because of the diverse and dynamic communication re- 
ouirernents. Networks on a drip (NoC) are emerging as an alter- 
native to existing on-chip interconnects because they (a) structure 
and manage global wires in new deep-subrm'cron technologies [2, 
3, 4, 61i (b) shore wires, lowering their number and increasing then- 
utilization (4, 6], (c) can be energy efficient and reliable [2], and 
(d) are scalable when compared to traditional busses [7]. 
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Decoupling the computation from eounmmicatioi) requires that 
services that IPs use to communicate (a) are well-defined, and 
(b) hide the implementation details of the interconnect [9] . see 
Figure 1(a). NoGs again help, because they are traditionally de- 
signed using layered protocol stacks [14], where each layer pro- 
vides a well-defined interface which decouples service usage from 
service irrtplemsntabon [13, 3]> 

In particular, guaranteed services are essential because they 
make the requirements on the NoC explicit, 0ms limiting the possi- 
ble interactions (a stricter contract) of IPs with the cxrmmunication 
environment As a result, JP design 2a simpler. IPs can also be 
designed independently, because their guaranteed services are not 
affected by the interconnect or by other IPs. This is essential for a 
compositional construction (design and programming) of systems 
on chip. Moreover; for guaranteed services, failures are restricted 
to the IP configuration phase (a service rceoest is either granted or 
denied by the NoC) which simplifies me IP programrning mode [6]. 
We view the guaranteed services to be offered by an interconnect 
as a reo^emepl from the applications, see Rgtrre 1(b). 

The drawback of using guaranteed services is that ihey require 
resource reservation for worst-case sc enari os. As sv consequence, 
resources may not be efficiently utilized, which may not be ac- 
ceptable in a system on a chip where coat constraints are typically . 
very tight, see Figure 1(c). lb overcome this problem, best-effort 
services can be used for less crincal communication requirements 
to fully utilize the available resources. Using best-effort services, , 
however, provide no guarantees. 

A compromise between using guarantees only and having an ef- 
ficient interconnect is to combine guaranteed and. best-effort scr- . 
vices. Guaranteed traffic should not be affected by best-effort traf- 
fic, while best-effort traffic may use all the resources not used by 




tf> (a) G>> CO 



figure 1: Network services (a) bide the interconnect details and 
allow reusable components to be build on top of Omni, (b) are 
driven by the application requirements, (c) their efficiency re- 
lies on technology and network organization, and (d) me build 
using a layered approach. 
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guaranteed traffic. Guaranteed services would then be used for the 
critical traffic requirements, and best-effort services for con-critical 
traffic requirements. 

In this paper, we first list a set of network^uideperident commu- 
mcarioii services that are essential in chip design. In the following 
sections, wo show the twide*oi& between efficiency and cost that 
we make in our NoC, In Section 3. we present (he trade oils and 
take decisions on network-related issues. In Section 4, We zoom 
into the internals of the key component of our NoC: a router which 
efficiently combines guaranteed and bestaxTort services. 

2. SERVICES 

The m creasing complexity of integrated circuits, and the strong 
time-to-mariei pressure reqaire modular designs and IP reuse. De- 
coupling conxuutation from communication in chip design serves 
toih these two requirements P]- This decoupling is realized by 
defining comniirm'can'on interfaces that provide weU-defined ser- 
vices and hide the xmplemeritariDn details of the interconnect 

We snow in Section I, that guaranteed services are essential to 
simplify DP design and integration. Exarnples of sucH guaranteed 
services are data integrity, which assumes the datete delivered un- 
corrupted, lossless data delivery, winch means no dam is dropped 
in the interconnect, in-order data delivery, which specifies that 
disorder in which data is delivered is the same order in which it 
has been sent. Other guarantees offer time-related bounds, such as 
throughput and latency. 

Guarantees require resource reservation for worsntese scenar- 
ios, which can be expensive. Per example, guaranteeing fcrougi- 
put fox a stream of data irrrplies reserving bandwidth for its peak 
thronghput, even when its average is ranch lowec As a conse- 
quence, when using guarantees, resources are often iindeiutflized. 

Resources are better utilized when best-effort traffic is used. 
Best-effort services do not reserve any resources, and hence provide 
no guarantees. As a consequence, their performance is dictated by 
boundary conditions, such as interconnect load. For example, a 
connection may become temporarily lossy in a congested network, 
if the network resolves congestion by dropping dam. 

Be^t^ort service nse resources well because they are typically 
designed for average-case scenarios as opposed to worst-case sce- 
narios. They are also easy and fast to use, as they require no re- 
source reservation. Their main disadvantage is their unpredictam> 
ity- one cannot rely cm a grven performance Q.ec, they do nototrer 
guarantees): - In the best case, If certain boundary conditi ons are 
assumed, a statistical performance can be derived. 

ThereCjuirements for guaranteed services and the efficiency con- 
stramt (good resource inclination) are conflicting. But a first stop 
to a predictable and low-cost interconnect is combining the guar- 
ameed and best-effort services in the same ixrterccrrnect Guaran- 
teed services would be used for critical traffic requirements, and 
best-effort services fornon-ctitical traffic requirements, Per exam- 
ple a video processing IP wiH typically require a lossless, m-onler 
video stream with guaranteed throughput, but possibly allows cor- 
rupted samples. Another example is cache updates which inquire 
uncorrected, lossless, low-latency dam transfer, but or dering and 
guaranteed throughput are less important. In Section 43 we show 
bow combining guaranteed and best-effort services efficiently uses 
common resources, fa the remainder of this section we analyze the 
tnfmnmiTTi tevet of abstraction at which the corranunication services 
must bo o£tered to hide tho network internals. 

Tram'ttonaDy; network services have been inrplmnented and of- 
fered using a layered protocol star*, typically aligned to the ISO- 
OSI reference model [14]. NoCs also take this approach (2, 3, 6, 
15X» because it structures and decornposes me service implernente- 



n'on, and the protocol stack concepts aid positjjjfg^^ojgggces . 

To achieve the decoupling of computation from communication, 
the comraumcatinn services must be offered at least at the level 
of the transport layer in OSI reference mndeL It is the first layer 
thai offers errd-to^end services, hiding the network derails; see Figw 
urel(d)[3]. 

The lowest three layers in the protocol stack, namely physical, 
date-link and network layers, are network specific. Therefore, these 
services should not be visible to the IPs if decoupling between com- 
putation morn comrra miration is desired. However, these layers are 
essential in rmplamfuuing can services, because constructing guar- 
antees without gnarrmtees at the layer below is cither very expan- 
sive, or even impossible, For example, implementing e lossless 
ccmrrinjucation an top of a lossy service requires acknovrtedgrneaii, 
dara retransmission, and Jilterirjg duplicated data. This leads Co a 
significant increase in traffic, and also a trade off between large 
buffer space requirements and long delays. Even worse, providing 
guarantees for trrne-reloted services is impossible if lower layers 
do not olfer these guarantees. For example, throughput can not be 
guaranteed if communication at a lower layer is lossy. As a con- 
sequence, guarantees can only be built on pop of guarantees, see 
figure 1 (b). Simaariy, a layer's efficiency Is based on efficient im- 
plementations of toe layers below it, see Figure 1(c). 

The NoC services that we consider essential for chip design are: 
data integrity, lossless data delivery, In-order delivery, throughput, 
and latency. Data integrity is always guaranteed. AH the other 
services can be guaranteed or not, expending on request, fa the 
next section, we describe briefly how these services are provided 
by our NoC and in Section 4 we describe in detail how cur router 
flf^hitficlui a enables an effici en t upp flcBBOflttfiojn of th*^^ services. 

3. NETWORKS ON CHIP 

Currently, the prevalent on-chip interconnects are busses and 
switches [ICQ. These are single-hop mtercomscts, meaning that 
there is no storage in the mteiccnneci itself Scalable interconnects 
require multiple hops with storage In every hop (router). This in- 
troduces a number of new issues, which we discuss in mis section. 

General computer network research is a mature research 
field [161 which has many issues in common with NoCs. How- 
ever, two significant differences between computer networks and 
on-chip networks make the bade oflfc in their design very differ- 
ent [Q First, routers of a NoC are naorc resconxe constrained than 
those in computer network, m particular in (he control cornpleadry 
and in the amount of memory. Second, cmrnmnnlcatien links of a 
NoC are relatively shorter than those in computer networks, allow- 
ing tight syriftaemtettion (network flow control) between routers. 

These two characteristics have a direct impact on the NoC sej> 
vice uTmlemenlatiOn. In a NoC it is possible to solve the date in- 
tegrity at (he date-link layer ax a low cost We, therefore, assume it 
solved at the network layer and higher. Lossless transport of date 
is guaranteed by our routers. However, to allow consumers slower 
than producers, the network may be allowed to drop data at its edge. 
Consequently, the designer may choose either for (a) a lossless con- 
nection (i-e_ implementing end-to-end flow control), or (b) a lossy 
connection (U.. Without flow control). In-oider delivery is again 
guaranteed by our router (i,e„ routers do not reorder data between 
a given input port and a given output port). ImeVte-end ordering 
of data, however, has to he provided on cop of this at the network 
edge when data is transported on different routes with, different de- 
lays. Offering guaranteed and best-effort throughput and latency 
services is also implemented by the routers. These router services 
together with the programming model explained in Section 432 
otter network throughput and latency services. 
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Wo identify four important issues £q the design of the router net- 
work architsctare. These are: the switching mode, routing* con- 
tension resolution, and network Jtew control. Equally important, 
enO-to-endftow control end congestion control are handled in our 
NoC at the network edge instead of the routers; we werefore omit 
their discussion hero. 

3.1 Switching Mode 

Tha switching mode of a network specifies how data and control 
ere related. We distinguish, circuit switching (ad packet switching. 

In crrcutt switching data and control areser^iraW. Bmdmconr 
tool is provided to the network (connection set up). This resnlcs in 
a circuit over which all subsequent data of die connection is trans- 
ported. In time-division switching bandwidth is shared by trme- 
cfivisian multiplexing connections over circuits. Circuit-switched 
networks inherently offer time-related guaranteed services when 
resources are reserved during the connection setup. 

In packet switching date is divided into pacfeff and every packet 
fa composed of a control part (the header), and a dad part (the p^y- 
hxuft. Network routers inspect, and possibly modify, the headers 
of incoming packets to switch the packet to the appropriate out- 
put port. Since in packet switching the packets are self contained, 
there is no aeed for a set-up phase to allocate resources. Best-eflbrt 
services are therefore naturally provided by packet switching. 

3.2 Routing 

Routing is the determination of die route (or path) that the data 
follows from source to destination. There are two basic approaches: 
source mating and destination routing* J» source routing, the net- 
work interface at the source computes tine complete route to the 
destination. In destination routing, only the network address of 
to destmatxoq Ss gpeelfleA and every router selects the arjpropriate 
output; based on the address. We refer to f 17] for several classes of 
routmg functions. 

In circuit switching, roaring takes place at connection setup, i^, 
once f<ff an data «i that cormect^ 

done for every individual packet sent over the network. In both 
cases, source and destination routing are possible, We currently 
consider source routing because it is independent of the router net- 
work topology, which is not yet deterrorned. 

3J3 Contention Resolution 

When a router attempts to send rrnilople data items over the same 
link at the same time contention is said to occur; As only one data 
item can occupy a luuc at any p omt in time a selection among the 
contending data must be made; this process is called conten tion 
resolution, ITiree approaches exist avoiding contention* dropping 

(one of the contertdmg o^ itemfa 
der are delated), and scheduling (or sequentiali2mg) c^a (afl data 
items are sent to turn; some data items are therefore delayed). 

In Circuit switching contention resolution takes place at set up 
at the granularity of cotnneciiona, so that data sent over different 
connections do not conflict. Thus, there is no contention during 
dam transport, and time-related guarantees can be given. 
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ulartty of taarviounl packets. Dropping packets is possible, but for 
a lossless service (a) it adds comjtoiry to the network (acknowl- 
edgments, retransmission, etc.), and (b) ic ultimately increases me 
traffic because dropped packets need to be regent Tba &, scheduling 
data is the only reclaming option. 
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Network flew control, also called routing mode deals with the 
limited amount of buffering in routers and data acceptance between 
routers. In circuit switching connections are setup. The data send 
over these connections is always accepted by the routers and hence 
no network flow control is needed. In packet switching, data must 
bo buffered at every router before they are sent on. Because routers 
have a limited amount of buffering they accept data only when they 
have enough space to store the incoming data. 

There are three types of flow control, namely stone aju?forward % 
Virtual cut-through* and wormhole routing. In store-aneVforward 
routing, an input packet is received and stored in its entirety before 
it is forwarded to the n ext router. This iecurres storage for the com- 
plete packet, and implies a par-router latency of at least the time 
required for the router to receive die packet 

m virtual cuMhiough routing a packet is forwarded as soon as 
the next router guarantees that the complete packet will be ac- 
cepted. Only when no guarantee is given, the whole packet is stored 
in the router. Thus, virtual cut-trough routing requires buffer space 
for a complete packer, like store and forward routing, but allows 
lower-latency communication. 

In wormhole renting packets are split in ao-calledJSis (flow con- 
trol digits). A flit is passed to the next router when mat router 
accepts thai flit, even when there is not enough buffer space for the 
complete packat. As soon as a flit of a packet is sent over an ourput 
port, that outpor poit is reserved for flits of that packet only. When 
the fee flit of a packet is blocked the trailing flics can therefore 
be spread over multiple routers, blocking the intermediate links. 
Wormhole routing requires me least suffering (buffer flits instead 
of packets) and also allows lcwdatency communication. However; 
ir is more sensitive to deadlock and generally results in lower link 
ndhzaj joa than virtual cut-through routing. 

We opx for wormhole routing because it offers low latency, which 
is one of cur targeted services, and because it bus the lowest coat in 
terms of bnfienng, which is expensive on-chip. 

4* A COMBINED GT-BE ROUTER 

Section 2 defines our recfeirexnents for NoCs in terms of services 
&at are to be offered, mpart^^ 

services. The previous section introduces a rjumber of general net- 
working issues mat will be buflt upon here. In the following two 
subsections we show thai the guaranteed and best-effort services 
can conceptually be described by two roder^ndent router architec- 
tures. The combination of theae two router architectures is effi- 
cient and has a flexible prtgremrning model, as described in Sub- 
section^. 

4*1 A 6T Router Architecture 

Our gnanniteecUhren^rmt (or) router must guarantee uncor- 
rupted» lossless and ordered data transfer, and both throughput and 
latency over a finite timo interval. As mentioned earlier, data in- 
tegrity js solved at the dataJinJc layer: wo do not address It further. 
No data is oVopped by the gt router because we use a variant of cir- 
cuit switching (described in the next section). Dam i3 transported 
in fixed-size blocks, further explained below. As only one block 
is stored per input in the GT router* blocks rernain ordered We 
row tarn to the more challenging tiroe-related guarantees, namely 
throughput and latency* 

4.1.1 Ttme-related Guarantees 

Latency is defined as tho time a packet spends in the network. 
Guaranteeing .latency, therefore, means that a worst-case upper 
bound must ha given for this tune. Here we define throughput 
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for a given producer-consumer pair as the amount of data tans- 
parted by the network ever a finite, fixed timo intervfli. Guarantee- 
ing throughput means giving a lower bound 
' We observe that guaranteeing latency in a lossless router is dif- 
ficult because contention requires sch refilling and hence delays, 
(fraranteeing throughput is less problematic Rate-based packer 
switching (for an overview see [18]) offers guaranteed tbronghpnt 
over a toite period, aad hence a latency bound. This bound is vciy 
high, however, and the cost of buffering is also high. Deadline- 
based packet switching [13] offers preferential treatment for pack- 
ets dose p their deadline, this allows differential latency guaran- 
tees (under certain admissible traffic assumptions), but also at high 
buffer costs. 

Circuit switching solves the contention at set up, so naturally 
providing guaranteed latency and throughput. Czrcnfts can be 
pipelined to iinproVB throughput [5], at the cost of additional 
buffering and latency. TmcKXitasici) multiplexing connections over 
pipelined circuits additionally oners flexibility in bandwidth alio* 
cation. This requires a notion of router synenrouicity, which Is pos- 
sible because a NoC is better controllable than a genera] network 
We canlain this variation in more detail in the new subsection. The 
associated programming model is described in Section 

4*1,2 Contention-free Routing 

A rooter uses a slot table to (a) avoid contention on a link, (b) 
divide up bandwidth per link; and (c) switch data to the correct out- 
put Every slot table Abas S fixed-size time slots (rows), and N 
router outputs (cohrmns). Tnere is a logical notion of synchrrmic- 
irv; all routers in the network are in the same slot in a slot a at 
most one block of data can be read/write per mput/onrput port The 
next slot (e-fl)%3, the read blocks are written to their appropriate 
output pons. Blocks thus propagate in a store and forward fashion. 
The latency a block incurs par router is equal to the duration of a 
slot Bandwidth is guaranteed in multmlss of block size per S slots. 

The entries of the slot table map curorts to inputs for every slot: 
R(s,o) = i. An entry is empty, when there is no reservation for 
that output in that slot No contention arises because these is at most 
one input per output Sending a single input to multiple outputs 
(multicast) is possible. 

The slots reserved for a Mock along its path from source to desti- 
nation increase by one (modulo £). If slot $ is reserved in a renter, 
slot (a + 1)%S most be reserved in the next router on the path. 
The assignment of slots to connections in the network is. an opti- 
adgatton problem, and is described in Section 4.3.3. Section 433. 
explains how slots arc reserved in the network, by means of best- 
effort packets. 

4J1 A BE Router Architecture 

Best-effort (fifi) traffic can have a better average performance 
than offered by guaranteed services. This depends on boundary 
conditions, such as network load, mat are unpredictable. Best- 
effort services thns rulnu* our efficiency requirement, but without 
offering time-related guarantees* This section describes an archi- 
tecture for a best-effort service with Un corrupted, lossless, in-order 
data transport. 

The router efficiency is influenced by both its complexity and 
its utilization, tn Section 3 we have justified our choice for rout* 
ing (soiree muring) and network flow control (woirnhnle). Now 
we determine the contention resolution scheme that is used, It has 
two components: buffering and scheduling. Our router prototypes 
show that the buffering costs dominate the cost of the router, the 
main trade off in Section A2.1 is therefore between buffer costs and 
link utilization, which are both critical resources. For the chosen 
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buffering strategy an efficient scheonlins algg^&2082cted in 
Section 4.2.2, trading off link utilization and schedule complexity. 

43.1 buffering Strategy 

ITxe buffering strategy derenrdnes (he location of buffers maids 
the router. Wo distinguish input Queuing, output queuing, and vir- 
tual output queuing. In the following, N Is the number of inputs 
(equal to the number of outputs) of the router. We believe that in 
a balanced solution the rates at which routers and KnVg operate is 
equal Slower routers require more bulEering, and faster routers are 
not feasible as links operate at high speed. 

In input queuing there is a sin g le queue per input resulting in 
the lowest buffer cost (£f logical queues in N physical memories) 
of aH three approaches. However; due to the so-called head-of- 
line blocking, for huge JV* network ntDoarion saturates at 59% [8], 
Therefore, input queuing results in weak utilization of the links, 

Output queuing can increase the link utilization to 100% by hav- 
ing N queues af each output or N 2 queues, with as many physical 
memories. It is better to have fewer larger, memories than more 
smaller memories because the overhead of small KAMs is very 
high, Ovendcoking the ranter by a factor N" to use iV memories 
is not possible, as argued previously; So the unmber of memories 
depends qpadrarJcaQy an N, hence output queuing h not scalable. 

Virtual output queuing [1J (voq) combines the advantages of 
input queuing and output queuing, tt has the bgjTering complexity 
of Input queuing and the link utilization of output queuing. As for 
output queuing, there fire N 2 logical queues, tut they are combined 
in N physical memories at the inputs as for input queuing-. For 
every input f there are N queues Q(i,o), one for each output o, see 
Figure 2- There is at most one write to these queues. The difference 
between output and voq is the additional corrstraint that there can 
be at most one read from this group of N qjww^ CMs enabled the 
mapping of an input queues of the same input to one memory;) This 
additional constraint baa to be taken into account by the scheduling. 
10095 link utilization can still be achieved, when N is large [12]. 

We select VOQ because it combines high link utilization with 
moderate buffer costs. 

Matrix Scheduling 

lids section shows how link contention and memory contention 
(imposed by voq) are resolved. Matrix scheduling solves both 
kinds of contention by ensuring that every voq rn^mory is read at 
most once, and every output-dink) f3 written to utmost once. The _ 
Scheduling problem can he modeled as a bipartite graph matching 
problem as follows. Every input port i is modeled by anode and 
every onmut port o by a node tb. There la an edge between ih and 
v 0 if and only if queue Q{i 7 o) is non-empty. A match is a subset 
of these edges such that every node is mtajsnt to at most one edge. 
For example. Figure 3(c) is a match of Figure 3(a). The number of 
edges in the match is its size; a match te maximal when no edges 
can be added to it. A maximum size tnatdi is a largest si2e match. 

Although optimal! there are two reasons not to consider only 




Figure 2i Schematic of a router using virtual output queuing. 
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Figure 3: The three steps of a single iSLTP iteration. 

ns tibwm aze matches. Rrst, maxtaim size matching algorithms 
have 0(N* /2 ) comolexiry. Since nmrrix sc h t a ln l fng is done at flit 
rare tfif « «Qt feasible for large N. Second maxtoun size match- 
ing algoriflims can be unfair, which can result in starvation, ie.. 
some queues are never served. 

There ate several matehing algorithms; see [11] for a thorough 
discussion. We select the iterative SUP (iSUF) roacrix scheduling 
algrn-ithm [11], because it has a low complejdty. avoids starvafioa 
and provides increasing performance as the number of iterations 
grows. It reaches a maximal match in log 2 (JV) iterations. Even a 
single iteration considerably outperforms input queuing, and can be 
efficiently implemented in hardware. Multiple iterations increase 
the latency of the cxmtrol path, and hence me flit size (as explained 
in Section 43. 1)» We consider using 1-SLTP because mnWpla iter- 
ations give only marginal improvement 

A single iSUP iteration has three steps, aCnatrated by an example 
£nRgore3foriV = 4. In the first stage, see Hgure 3(a), every non- 
empty queue o) requests access to output port o from input 
port i. m. the second stage, see Hgure 3(b), every output port o 
grants one request, solving link contention at me output ports. In 
the third stage, see Figure 3(c), every hnnrt port *a«^oire grant, 
to resolve memory contents input pott, Wc extend f SLIP 
to take network, flaw control into account, 

43 Combining the GT and BE Routers 

The QT and be router architectures are combined to share re- 
sources, in particular the links, meinoiies, and switches. Moreover, 
best-effort traffic enables a packet-based programming model for 
the guaranteed traffic, as ahown later* in Section 4,3.2. 

The principal constraint for a combined router architecture is thai 
guaranteed services are never affected by best^effon services. Fig- 
ure 4(a) shows that, conceptually, the combined router contains 
bom rooter architectures (fct lines represent data transport, thin 
lhies represent conrrM 

tber me or or the be router. The gt traffic (the traffic that is served 
by the GT router) has the higher priority, to maintain guarantees. 
This is ensured by the arbitration unit, which therefore affects the 
besfreffbrt scheduling, Rirmermore, best-effort packets can pro 
gram me guaranteed router, as shown by the enow labeled pro- 
gram. Thm lines going from the right to the left indicate n etwork 
flow control, which is only required for bestpeffart packets because 
guaranteed blocks never encounter contention. 

On a shared link only one BE or GT data item can arrive or be 
sent at any point m dme^ Thw 

jreenmgthenrrmberof^ logical queues 

in total Figure 4(b) shows that the c^r^«msisting of memo- 
ries and switch matrix is shared, and that the control paths of the be 
and GT tourers are separate, yet ntfftrrafated. Moreover, the arbitra- 
tion nnk of Figure 4(a) has been absorbed by the BB router. The 
following subsection shows how dris can be done. 

4*3.1 Arbitration and Flit Size 

When combining GT and BE traffic in a single network the fan- 
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Figure 4t Two views of the combined GT-BE router. 

pact on the networfc flow control scheme must be taken into ac- 
count Recall from Section 3.4 that a BE flit is the smallest unit or 
which flow control is performed. In Other words, the BB scheduling, 
using iSUP, can only react to GT blocks at flit granularity. Ib avoid 
alignment problems, the block size (B words) is a multiple of the 
flits (F words, with B c= CF). Z is co ns tant; we prefer a small £ and 
F to decrease the smre-and-forward delay tor guaranteed traffic 

We extend 4SLIP to handle the combination of GT and be traffic. 
In mis combination gt traffic always has priority over BB traffic, 
This is to ensure that guarantees are never corrupted. 



4.3.2 Programming Model 

In this section we show how gt corrections are set up and torn 
down by rmv^ of be packers. To ensure scalability, programming 
must not require global or centralized resources. Section 4.1.2 ex- 
plains why our contention-tree renting uses slot tables; we now see 
that they are distributed ovor routers for scalability. 

Initially the slot table of every renter is empty. Tins means that 
GT connections can only be set up using EE packets, unless an aaV 
ffifrmni cornmnnication fa^tmamre to introduced flolely for pro- 
grmnming. Two special packets, Reset and Start, are used to reset 
and start the NoC, respectively. They progress by flooding, and are 
not subject to the usual network flow control. We will not discuss 
them further. There are three system packets: SetUp, TearDown. 
and AckSetUp, They are used to program the slot table in every 
router on their path. 

The SetUp packet is used to create a connection from a source 
to a destmation* and travels in the direction of the data ("down- 
stream"). AckSetUp acknowledges a successful set Up, and flows 
upstream. The TearDown packet destroys (partially) existing con- 
nections, and can travel in cither direction. SdtUp packets contain 
' the source of the data, the path to their destination, and a slot num- 
bar. Every renter along the path of the Setup packet checks if the 
output to the next router in the path is free in the slot mdicared by 
the packet. If it is tree, the output is reserved in that slot, and the 
SetUp packet is forwarded with an racremented (modulo 5) slot. 
Otherwise, the SetUp packet is discarded and a TearDown packet 
returns along the same path. Thus every path must be reversible; 
this is the only assumption we make about the network topology. 
These upstream TearDown packers firee the slot, and continue with 
a decremented slot Downstream TearDown packets work shnflady, 
and remove existing connections. A connection is successfully cre- 
ated when an AckSetUp is received, else a TearDown is received. 

The programming "imj al is pipelined and concurrent (multiple 
system packets can be active in the network sirnultaneously, also 
from the same source) and distributed (active in multiple routers). 
Given the distributed nature of the programming model, ensuring 
consistency and determinism is creciaL The outcome of program-, 
ming may depend on the execution order of system pockets, but is 
always consistent. The next section shows how to use the program- 
ming model. 
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433 SlotAUocation 

This section explains ways to determine the slots specified in 
Setup packets. A slot allocation for a single connection requires 
ttofcat ovary router along the path, the Teq^donqratiaffeciiiihe 
appropriate slot Therefore, intasfer-noe of SetUp packets of mul- 
tiple connections can be completely avoided if connections are sec 
up with conflict-free slots or paths. All execution orders of SetUp 
packets then give the same result. 

Computing an optimal slot allocation h complex and requires a 
global network view. It can be used only for small problem in- 
stances. To reduce computational cost, heuristics can be used, but 
wis probably leads to non-opdmaj solutions. Connie-tune slot al- 
locations from both approaches can be recreated dettrmmisticnlly 
ai run tune, concurrenily and distrmuteely (because all SatUp pack* 
ets are connict-ftee). 

Ax run time, a global view requires a centralized slot allocation. 
Tins impairs scalability and slows tfewn programming. JRim-time 
distributed slot allocation is scalable, but lacks a global view. This 
typically results in subopmnal slot allocation. Moreover, setup 
packets may interfere, molting programming mora involved, and 
perhaps non^Ieterministic. However, dynamic concecuon man- 
agement at high rates will require distributed slot allocation. In 
a simple distributed greedy algorithm, an sources repeatedly gen- 
erate random slot numbers for each set up until their connection 



Wa conclude that our progrannuing model allows both eompflc- 
time and run-time slot allocation. Computational complexity, de- 
ternunistio results, and scalability can he balanced according to sys- 
tem requirements. 

5. conclusions 

Managing the complexity of designing chips containing tduions 
of transistors requires decoupling computation mam communica- 
tion. For communication* networks on chip (NoC) are emerging 
as an alternative for existing mxerconnects to solve technological, 
performance, and scalability problems. 

Ja this paper we show mat guaranteed services are essential to 
provide predictable interconnects that enable compositional system 
design and integration. However, guarantees typically ntn*w re- 
sources inefficiently. Best-effort services overcome this problem 
but provide no guarantees. So, combining guaranteed and best- 
effort services allows efficient resource utilization, yet still provid- 
ing guarantees for critical traffic. 

lime-related guarantees, such as throughput and latency, can 
only be constructed On a NoC that intrinsically has ftggft gmpw fffos . 
Wo therefore define a router-based NoC architecture thai combines 
guaranteed and beai-efreit services. Thus, the router architecture 
has conceptually two parts; the guaranteed thrcnghput (gt) and 
best-effort (fie) routers. Both offer data integrity, lossless data de- 
livery, and in-order data delivery. Additionally, the gt router offers 
guaranteed throughput and latency Services using m'pelined circuit 
switching with time-division multiplexing. This requires a notion 
ofsynchroniciry: at each time slot at most one block of data is com- 
municated over a lini. The gt router has low latency and moder- 
ate memory renuircmenis. Hie BE router uses packet switching, 
wonuhole routing, and virtual output queuing with tSLIP. The bb 
router has low latency, high Hnfe utilization, and moderate memory 
requirements. 

We combine the GT and BE router architectures efficiently by 
sharing router resources. The guarantees are never affected by the 
fifi traffic, and links are efficiently utilized because BE traffic uses 
all bandwidth left over by GT traffic Connections am programmed 



using be pacfteta. The pmgrannning model i 
and distributed. It enables run-time and cc 
tic and adaptive connection rnanagemeni. 

For all our architecture choices, we show the trade ofis between 
hardware complexity and efficiency, and motivate our choices, 

In conclusion, we describe and motivate a combined guaranteed 
and best-effort router, which is an essential component in a NoC 
& fulfills our requirements by providing g^nmrTTi services, and 
satisfies the efficiency constraint by good resource utilization. 
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1 . Integrated circuit comprising a plurality of modules, and a network arranged 
for transferring messages between the modules, wherein a message issued by a 
module comprises first information indicative for a location of an addressed module 
within the network, and second information indicative for a location within the 
addressed module, 

characterized in that the first and the second information are arranged as a single 
address from which the network determines which module is addressed, and from 
which the addressed module determines which of its locations is selected. 

2. Method for exchanging messages in an integrated circuit comprising a 
plurality of modules, the messages between the modules being exchanged via a 
network, wherein a message issued by a module comprises fitst information indicative 
for a location of an addressed module within the network, and second information 
indicative for a location within the addressed module, 

characterized in that the first and the second information are arranged as a single 
address from which the network determines which module is addressed, and from 
which the addressed module determines which of its locations is selected. 

3 . Integrated circuit comprising a plurality of processing modules and a network 
arranged for providing at least one communication between a first and a second 
mdule> which communication channel supports transactions comprising outgoing 
messages from the first module to the second module and return messages from the 
second module to the first module, characterized in that the network manages the 
outgoing messages in a way different from the return messages. 

4. Method for exchanging messages in an integrated circuit comprising a 
plurality of modules, the messages between the modules being exchanged via a 
network, wherein a communication channel through the network supports transactions 
comprising outgoing messages from the first module to the second module and return 
messages from the second module to the first module, characterised in that the 
network manages the outgoing messages in a way different from the return messages. 

5. Integrated circuit according to claim 3, wherein the network has a first mode 
wherein a message is transferred within a guaranteed time interval, and a second 
mode wherein a message is transferred as fast as possible with the available resources, 
wherein the outgoing transaction is a read message, requesting the second module to 
send data to the first module, wherein the return transaction is the data generated by 
the second module upon this request, and wherein the outgoing transaction is 
transferred according to the second mode, and the return transaction is transferred 
accor ding to the first mode. 
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6. Integrated circuit according to claim 3, wherein the network allows at least 
two of the following transaction modes unordered, locally ordered and globally 
ordered, wherein an unordered transaction mode of the network gives no guarantees 
for the order in which messages will arrive at their destination, a locally ordered 
transaction mode guarantees that messages sent to the same destination will arrive in 
the same order as they were sent, a global ordered transaction mode guarantees that 
messages will arrive in the same order as they were sent even if they are sent to 
different destinations, wherein outgoing and return transactions are handled according 
to different transaction modes. 

7. Integrated circuit according to claim 3, wherein the network reserves a fist 
and a second buffer space for the first and the second module respectively, the 
bufferspaces having a mutually different aim 

8. Integrated circuit comprising a plurality of modules, which modules are 
arranged to communicate to each other via a network, wherein the network is 
arranged to distribute a message from a first module to two or more second modules, 
and wherein the second modules are arranged to generate an acknowledge message 
indicating receipt of the message from the first module, 

the network being arranged to generate a single return message to the first module, in 
dependence of the acknowledge messages of the second modules. 

9. Integrated circuit according to claim 8, wherein the single return message 
indicates that at least one of the second modules has received the message issued by 
the first module, 

1 0. Integrated circuit according to claim 8, wherein the single return message 
indicates that each of the second modules has received the message issued by the first 
module. 

11. . . _ Integrated circuit comprising a first plurality of processing modules and a 
network, the network co mp ri s ing a second plurality of nodes and interconnections 
between nodes, the network being arranged for transferring messages between a first 
and a second modules via a path through the network, the processing modules coupled 
to the network via a network intcrfece having a buffer for receiving incoming 
messages, wherein a message from a first to a second module is not initiated until the 
buffer has sufficient space for receiving a return message from the second module. 
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Abstract 

Networks are emerging as a possible solution for oi> 
chip interconnects, th this paper, we describe how net* 
works on chip (NoC) are similar to and differ from 
both off-chip networks (e.g 0 computer networks) and cur- 
rent on-chip interconnects (eg., buses). We le-examine 
the communication services in the context of NoCs. We 
provide services that abstract from network implementa- 
tions enabling a clean separation between the NoC and 
IP blocks. We define a request-response transaction model 
similar to bus protocols, making our approach back- 
ward compatible. To exploit the full power of NoCs, 
we also provide connection-oriented communication with 
differentiated services. Examples are bandwidth guaran- 
tees, transaction ordermgs, and end-to-end flow control. 

Keywords: Networks on chip, on-chip buses, compucer 
networks, communication services, protocol stack, 
transaction, connection. 
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